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"This is the first book that promises to tell the deep, dark secrets of 
computer arithmetic, and it delivers in spades. It contains every trick I 
knew plus many, many more. A godsend for library developers, 
compiler writers, and lovers of elegant hacks, it deserves a spot on 
your shelf right next to Knuth."-Josh Bloch 


"When I first saw the title, I figured that the book must be either a 
cookbook for breaking into computers (unlikely) or some sort of 
compendium of little programming tricks. It's the latter, but it's 
thorough, almost encyclopedic, in its coverage." -Guy Steele 


These are the timesaving techniques relished by computer hackers- 
those devoted and persistent code developers who seek elegant and 
efficient ways to build better software. The truth is that much of the 
computer programmer's job involves a healthy mix of arithmetic and 
logic. In Hacker's Delight, veteran programmer Hank Warren shares 
the tricks he has collected from his considerable experience in the 
worlds of application and system programming. Most of these 
techniques are eminently practical, but a few are included just 
because they are interesting and unexpected. The resulting work is an 
irresistible collection that will help even the most seasoned 
programmers better their craft. 


Topics covered include: 

A broad collection of useful programming tricks - 
Small algorithms for common tasks - 

Power-of-2 boundaries and bounds checking - 
Rearranging bits and bytes - 

Integer division and division by constants - 


Some elementary functions on integers - 
Gray code - 

Hilbert's space-filling curve - 

And even formulas for prime numbers! 


This book is for anyone who wants to create efficient code. Hacker's 
Delight will help you learn to program at a higher level-well beyond 
what is generally taught in schools and training courses-and will 
advance you substantially further than is possible through ordinary 
self-study alone. 
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Dedication 


To Joseph W. Gauld, my high school algebra teacher, for sparking in 
me a delight in the simple things in mathematics. 


Foreword 


When I first got a summer job at MIT's Project MAC almost 30 years 
ago, I was delighted to be able to work with the DEC PDP-10 
computer, which was more fun to program in assembly language than 
any other computer, bar none, because of its rich yet tractable set of 
instructions for performing bit tests, bit masking, field manipulation, 
and operations on integers. Though the PDP-10 has not been 
manufactured for quite some years, there remains a thriving cult of 
enthusiasts who keep old PDP-10 hardware running and who run old 
PDP-10 software—entire operating systems and their applications—by 
using personal computers to simulate the PDP-10 instruction set. They 
even write new software; there is now at least one Web site whose 
pages are served up by a simulated PDP-10. (Come on, stop laughing 
—it's no sillier than keeping antique cars running.) 


I also enjoyed, in that summer of 1972, reading a brand-new MIT 
research memo called HAKMEM, a bizarre and eclectic potpourri of 
technical trivia.[i] The subject matter ranged from electrical circuits to 
number theory, but what intrigued me most was its small catalog of 
ingenious little programming tricks. Each such gem would typically 
describe some plausible yet unusual operation on integers or bit 
strings (such as counting the 1-bits in a word) that could easily be 
programmed using either a longish fixed sequence of machine 
instructions or a loop, and then show how the same thing might be 
done much more cleverly, using just four or three or two carefully 
chosen instructions whose interactions are not at all obvious until 
explained or fathomed. For me, devouring these little programming 
nuggets was like eating peanuts, or rather bonbons—I just couldn't 
stop—and there was a certain richness to them, a certain intellectual 
depth, elegance, even poetry. 


1] Why "HAKMEM"? Short for "hacks memo"; one 36-bit PDP-10 word 
could hold six 6-bit characters, so a lot of the names PDP-10 hackers 
worked with were limited to six characters. We were used to glancing 
at a six-character abbreviated name and instantly decoding the 
contractions. So naming the memo "HAKMEM" made sense at the time 
—at least to the hackers. 


"Surely," I thought, "there must be more of these," and indeed over the 
years I collected, and in some cases discovered, a few more. "There 
ought to be a book of them." 


I was genuinely thrilled when I saw Hank Warren's manuscript. He has 


systematically collected these little programming tricks, organized 
them thematically, and explained them clearly. While some of them 
may be described in terms of machine instructions, this is not a book 
only for assembly language programmers. The subject matter is basic 
structural relationships among integers and bit strings in a computer 
and efficient techniques for performing useful operations on them. 


These techniques are just as useful in the C or Java programming 
languages as they are in assembly language. 


Many books on algorithms and data structures teach complicated 
techniques for sorting and searching, for maintaining hash tables and 
binary trees, for dealing with records and pointers. They overlook 
what can be done with very tiny pieces of data—bits and arrays of 
bits. It is amazing what can be done with just binary addition and 
subtraction and maybe some bitwise operations; the fact that the carry 
chain allows a single bit to affect all the bits to its left makes addition 
a peculiarly powerful data manipulation operation in ways that are 
not widely appreciated. 


Yes, there ought to be a book about these techniques. Now it is in 
your hands, and it's terrific. If you write optimizing compilers or high- 
performance code, you must read this book. You otherwise might not 
use this bag of tricks every single day—but if you find yourself stuck 
in some situation where you apparently need to loop over the bits in a 
word, or to perform some operation on integers and it just seems 
harder to code than it ought, or you really need the inner loop of 
some integer or bit-fiddly computation to run twice as fast, then this is 
the place to look. Or maybe you'll just find yourself reading it straight 
through out of sheer pleasure. 


Guy L. Steele, Jr. 
Burlington, Massachusetts 
April 2002 


Preface 


Caveat Emptor: The cost of software maintenance increases with the 
square of the programmer's creativity. 


—First Law of Programmer Creativity, Robert D. Bliss, 1992 


This is a collection of small programming tricks that I have come 
across over many years. Most of them will work only on computers 
that represent integers in two's-complement form. Although a 32-bit 
machine is assumed when the register length is relevant, most of the 
tricks are easily adapted to machines with other register sizes. 


This book does not deal with large tricks such as sophisticated sorting 
and compiler optimization techniques. Rather, it deals with small 
tricks that usually involve individual computer words or instructions, 
such as counting the number of 1-bits in a word. Such tricks often use 
a mixture of arithmetic and logical instructions. 


It is assumed throughout that integer overflow interrupts have been 
masked off, so they cannot occur. C, Fortran, and even Java programs 
run in this environment, but Pascal and ADA users beware! 


The presentation is informal. Proofs are given only when the 
algorithm is not obvious, and sometimes not even then. The methods 
use computer arithmetic, "floor" functions, mixtures of arithmetic and 
logical operations, and so on. Proofs in this domain are often difficult 
and awkward to express. 


To reduce typographical errors and oversights, many of the algorithms 
have been executed. This is why they are given in a real programming 
language, even though, like every computer language, it has some 
ugly features. C is used for the high-level language because it is widely 
known, it allows the straightforward mixture of integer and bit-string 
operations, and C compilers that produce high-quality object code are 
available. 


Occasionally, machine language is used. It employs a three-address 
format, mainly for ease of readability. The assembly language used is 
that of a fictitious machine that is representative of today's RISC 
computers. 


Branch-free code is favored. This is because on many computers, 
branches slow down instruction fetching and inhibit executing 


instructions in parallel. Another problem with branches is that they 
may inhibit compiler optimizations such as instruction scheduling, 
commoning, and register allocation. That is, the compiler may be 
more effective at these optimizations with a program that consists of a 
few large basic blocks rather than many small ones. 


The code sequences also tend to favor small immediate values, 
comparisons to zero (rather than to some other number), and 
instruction-level parallelism. Although much of the code would 
become more concise by using table lookups (from memory), this is 
not often mentioned. This is because loads are becoming more 
expensive relative to arithmetic instructions, and the table lookup 
methods are often not very interesting (although they are often 
practical). But there are exceptional cases. 


Finally, I should mention that the term "hacker" in the title is meant in 
the original sense of an aficionado of computers—someone who 
enjoys making computers do new things, or do old things in a new 
and clever way. The hacker is usually quite good at his craft, but may 
very well not be a professional computer programmer or designer. The 
hacker's work may be useful or may be just a game. As an example of 
the latter, more than one determined hacker has written a program 
which, when executed, writes out an exact copy of itself. [1] This is the 
sense in which we use the term "hacker." If you're looking for tips on 
how to break into someone else's computer, you won't find them here. 


[1] The shortest such program written in C, known to the present 
author, is by Vlad Taeerov and Rashit Fakhreyev and is 64 characters 
in length: 


main (a) {printf (a, 34,a="main(a) {printf (a, 34, a=Sc%S%C, 34) ; | 
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Chapter 1. Introduction 


Notation 


Instruction Set and Execution Time Model 


1-1 Notation 


This book distinguishes between mathematical expressions of ordinary 
arithmetic and those that describe the operation of a computer. In 
"computer arithmetic," operands are bit strings, or bit vectors, of some 
definite fixed length. Expressions in computer arithmetic are similar to 
those of ordinary arithmetic, but the variables denote the contents of 
computer registers. The value of a computer arithmetic expression is 
simply a string of bits with no particular interpretation. An operator, 
however, interprets its operands in some particular way. For example, 
a comparison operator might interpret its operands as signed binary 
integers or as unsigned binary integers; our computer arithmetic 
notation uses distinct symbols to make the type of comparison clear. 


The main difference between computer arithmetic and ordinary 
arithmetic is that in computer arithmetic, the results of addition, 
subtraction, and multiplication are reduced modulo 2°, where n is the 
word size of the machine. Another difference is that computer 
arithmetic includes a large number of operations. In addition to the 
four basic arithmetic operations, computer arithmetic includes logical 
and, exclusive or, compare, shift left, and so on. 


Unless specified otherwise, the word size is 32 bits, and signed 
integers are represented in two's-complement form. 


Expressions of computer arithmetic are written similarly to those of 
ordinary arithmetic, except that the variables that denote the contents 
of computer registers are in bold-face type. This convention is 
commonly used in vector algebra. We regard a computer word as a 
vector of single bits. Constants also appear in bold-face type when 
they denote the contents of a computer register. (This has no analogy 
with vector algebra because in vector algebra the only way to write a 
constant is to display the vector's components.) When a constant 
denotes part of an instruction, such as the immediate field of a shift 
instruction, light-face type is used. 


If an operator such as "+" has bold-face operands, then that operator 
denotes the computer's addition operation ("vector addition"). If the 
operands are light-faced, then the operator denotes the ordinary scalar 
arithmetic operation. We use a light-faced variable x to denote the 
arithmetic value of a bold-faced variable x under an interpretation 
(signed or unsigned) that should be clear from the context. Thus, if x 
= 0x80000000 and y = 0x80000000, then, under signed integer 
interpretation, x = y = -231,x + y = -232,andx + y = O. Here, 


0x80000000 is hexadecimal notation for a bit string consisting of a 1- 
bit followed by 31 0-bits. 


Bits are numbered from the right, with the rightmost (least significant) 
bit being bit 0. The terms "bits," "nibbles," "bytes," "halfwords," 
"words," and "doublewords" refer to lengths of 1, 4, 8, 16, 32, and 64 
bits, respectively. 


Short and simple sections of code are written in computer algebra, 
using its assignment operator (left arrow) and occasionally an if 
statement. In this role, computer algebra is serving as little more than 
a machine-independent way of writing assembly language code. 


Longer or more complex computer programs are written in the C+ + 
programming language. None of the object-oriented features of C+ + 
are used; the programs are basically in C with comments in C+ + style. 
When the distinction is unimportant, the language is referred to 
simply as "C." 


A complete description of C would be out of place in this book, but 
Table 1-1 contains a brief summary of most of the elements of C 
[H&S] that are used herein. This is provided for the benefit of the 
reader who is familiar with some procedural programming language 
but not with C. Table 1-1 also shows the operators of our computer- 
algebraic arithmetic language. Operators are listed from highest 
precedence (tightest binding) to lowest. In the Precedence column, L 
means left-associative; that is, 


a*hbec=(a*b)ec 


and R means right-associative. Our computer-algebraic notation 
follows C in precedence and associativity. 


In addition to the notations described in Table 1-1, those of Boolean 
algebra and of standard mathematics are used, with explanations 
where necessary. 


Table 1-1. Expressions of C and Computer Algebra 
CorPpacettige bra 
Biexad@bimal, binary constants 
S6lecting the kth component 
HGfferent variables, or bit selection (clarified in text) 
Kénxctinn evaluation 


Asolute value (but abs(-231) = -231) 
ih@les{irye of the absolute value 
R&stinccement, decrement 
Pdeincrement, decrement 
Téppecoamersion 

x4Rthe kth power 

Bitwise not (one's-complement) 
Légical not (if x = 0 then 1 else 0) 
axtithmetic negation 
¥Ailviplication, modulo word size 
Sihed integer division 

Ynsigped integer division 


R2n(aingler (may be negative), of (x + y) signed arguments 


BAX, y) x “4 ¥ ; 
Remainder of *" = -" *unsigned arguments ; 
uçgg modulo y to the interval [0, abs(y) - 1]; signed arguments 
AAditvion -subtraction 
Spiltleft, negt with 0-fill ("logical" shifts) 


Spift right with sign-fill (arithmetic" or "algebraic" shift) 
ieee gts lett, right 
Seg compro 
š S VY, 


xX > V.X >; 
Ypssnedg eginparison 
r $: e : A, 
> PX > Y 
reny tality 
RiBsyise F 


Rigpe exclusive or 
X 


bitwike equivalence (-(x B 

ise or 
Congttional and (if x = 0 then 0 else if y = 0 then 0 else 1) 
KAF 


—ł 
Gonditjonal or (if x Æo then 1 else if y Æo then 1 else 0) 
dhyatenation 
Assteyment 


Our computer algebra uses other functions, in addition to "abs, 
and so on. These are defined where introduced. 


" "r " 


rem, 


In C, the expression x < y < z means to evaluate x < y to a 0/1- 


valued result, and then compare that result to z. In computer algebra, 
the expression x < y < z means (x < y) & (y < z). 


C has three loop control statements: while, do, and for. The 
while statement is written: 


while (expression) statement 


First, expression is evaluated. If true (nonzero), statement is executed 
and control returns to evaluate expression again. If expression is false 
(0), the while-loop terminates. 


The do statement is similar, except the test is at the bottom of the 
loop. It is written: 


do statement while (expression) 


First, statement is executed, and then expression is evaluated. If true, 
the process is repeated, and if false, the loop terminates. 


The for statement is written: 
for (e1; e2; e3) statement 


First, e1, usually an assignment statement, is executed. Then e2, 
usually a comparison, is evaluated. If false, the for-loop terminates. If 
true, statement is executed. Finally, e3, usually an assignment 
statement, is executed, and control returns to evaluate e2 again. Thus, 
the familiar "do i = 1 to n" is written: 


for {i= 1; i <= n; itt) 


(This is one of the few contexts in which we use the postincrement 
operator.) 


1-2 Instruction Set and Execution Time Model 


To permit a rough comparison of algorithms, we imagine them being 
coded for a machine with an instruction set similar to that of today's 
general purpose RISC computers, such as the Compaq Alpha, the SGI 
MIPS, and the IBM RS/6000. The machine is three-address and has a 
fairly large number of general purpose registers—that is, 16 or more. 
Unless otherwise specified, the registers are 32 bits long. General 
register 0 contains a permanent 0, and the others can be used 
uniformly for any purpose. 


In the interest of simplicity there are no "special purpose" registers, 
such as a condition register or a register to hold status bits, such as 
"overflow." No floating-point operations are described, because that is 
beyond the scope of this book. 


We recognize two varieties of RISC: a "basic RISC," having the 
instructions shown in Table 1-2, and a "full RISC," having all the 
instructions of the basic RISC plus those shown in Table 1-3. 


Table 1-2. Basic RISC Instruction Set 
Opcbdpetdinrttaronic 

BROst—RABopmBaR, where atiisaddssmbtraemumultiply, divide signed, 
divide unsigned, remainder signed, or remainder unsigned. 
RGs—RAudDp I, where op is add or multiply, and I is a 16-bit 
signed immediate value. 
ROs—RA + (I || 0x0000). 
Ri¢=—BRRORORB, where op is bitwise and, or, or exclusive or. 
As akay@rexceptothe last operand is a 16-bit unsigned immediate 
value. 
Brahh to target if Be & oP SF tf R1750, orif RT < 0, orif RT < 
0, or if RT > 0, orif RT = 0 (signed integer interpretation of RT). 
Brandhftgee/false; same as bne/beq resp. 
Riygets tenesalt ofomparingp RA,withpBB; 0 éfnfadse, anchp if trpe. 
Dineheonicsdemate compage for equality, inequality, less than, and so 
on, as for the branch instructions, and in addition, the suffix "u" 
denotes an unsigned comparison. 
hikeRagpeqnandsg onnexdept thesdcond comparand dma igebit 
signed immediate value. 
hikeRagby tompandsa onçepdept the secontcaomparandtisa lGebit geu 
unsigned immediate value. 
RGpdi ARAMsignedtbyte, signed halfword, unsigned halfword, or word 
into RT from memory at location RA + d, where disa 16-bit 


signed immediate value. 

RU ets Hiuiigh-order 32 bits of the product of RA and RB; signed 
and unsigned. 

RO}*—bitwise one's-complement of RA. 

Ri }—paBhifted kft or right by the amount given in the rightmost six 
bits of RB; 0-fill except for shrs, which is sign-fill. (The shift 
amount is treated modulo 64.) 

Ri}—Ratshiftedteftar right by the amount given in the 5-bit 
immediate field. 

Stwg alate, balfword, or word, from RS into memory at location RA 
+ d, where dis a 16-bit signed immediate value. 


In these brief instruction descriptions, RA and RB appearing as source 
operands really means the contents of those registers. 


A real machine would have branch and link (for subroutine calls), 
branch to the address contained in a register (for subroutine returns 
and "switches"), and possibly some instructions for dealing with 
special purpose registers. It would, of course, have a number of 
privileged instructions and instructions for calling on supervisor 
services. It might also have floating-point instructions. 


Some other computational instructions that a RISC computer might 
have are identified in Table 1-3. These are discussed in later chapters. 


Table 1-3. Additional Instructions for the "Full RISC" 
Opchdpetinrttaronic 

BB getntitesabsolute value, or the negative of the absolute value, of 
RA. 
Rityëgşe, antv witheomplementdof RB), equivalence, negative and, 
negative or, and or with complement. 
Retraat, bits. I through I+L-1 of RA, and place them right-adjusted 
in RT, with 0-fill. 
hikjeRAxE ¢Lbut sign-fill. 
hiseRabits,@through L-1 of RA into bits I through I+L-1 of RT. 
RT gets the number of leading O's in RA (0 to 32). 
ROygets the number of 1-bits in RA (0 to 32). 
k@pdid Rigned byte into RT from memory at location RA + d, where 
d is a 16-bit signed immediate value. 
BO RRA? RA = Ber r agy and so ‘oh, else ORT is unchanged. 
Bi}*—Rabotate-shifted left or right by the amount given in the 
rightmost five bits of RB. 
Ri }—RAsiotatée-shifted left or right by the amount given in the 5-bit 
immediate field. 
RAPBB, trpne, trplt, trple, trpgt, trpge, trpltu, 


Trap Gitertd IE RA PE or RA — <n and so on. 
kRikpieqpeorandso ony pidept the sdcond comparandtisa igebit 
signed immediate value. 

Rikei ote) turandsa onrpxeept the secomaducomparandiis ath6ebiteu, 
unsigned immediate value. 


It is convenient to provide the machine's assembler with a few 
"extended mnemonics." These are like macros whose expansion is 
usually a single instruction. Some possibilities are shown in Table 1-4. 


Table 1-4. Extended Mnemonics 
Exteditigohiyiiiiem onic 
Dacomdi¢ionabbranch. 
Poad nnmediate, -231 S < 232, 
MorveRTgRIerORA to RT. 
NepateTtwa'skamplement). 


aR Redne (x e219). 


The load immediate instruction expands into one or two instructions, 


< 


as required by the immediate value I. For example. if 0 =I < 216, an 
or immediate (ori) from RO can be used. If -215 =I < 0, an add 
immediate (addi) from RO can be used. If the rightmost ié bis of I 
are 0, add immediate shifted (addis) can be used. Otherwise, two 
instructions are required, such as addis followed by ori. 
(Alternatively, in the last case a load from memory could be used, but 
for execution time and space estimates we assume that two 
elementary arithmetic instructions are used.) 


Of course, which instructions belong in the basic RISC, and which 
belong in the full RISC is very much a matter of judgment. Quite 
possibly, divide unsigned and the remainder instructions should be 
moved to the full RISC category. Shift right signed is another 
suspicious instruction, given its low frequency of use in the SPEC 
benchmarks. The trouble is, in C it is easy to accidentally use these 
instructions, by doing a division with unsigned operands when they 
could just as well be signed, and by doing a shift right with a signed 
quantity (int) that could just as well be unsigned. Incidentally, shift 
right signed (or shift right arithmetic, as it is often called) does not do 
a division of a signed integer by a power of 2; you need to add 1 to 
the result if the dividend is negative and any nonzero bits are shifted 
out. 


The distinction between basic and full RISC involves many other such 
questionable judgments, but we won't dwell on them. 


The instructions are limited to two source registers and one target, 
which simplifies the computer (e.g., the register file requires no more 
than two read ports and one write port). It also simplifies an 
optimizing compiler, because the compiler does not need to deal with 
instructions that have multiple targets. The price paid for this is that a 
program that wants both the quotient and remainder of two numbers 
(not uncommon) must execute two instructions (divide and 
remainder). The usual machine division algorithm produces the 
remainder as a by-product, so many machines make them both 
available as a result of one execution of divide. Similar remarks apply 
to obtaining the doubleword product of two words. 


The conditional move instructions (e.g., moveq) ostensibly have only 
two source operands, but in a sense they have three. Because the 
result of the instruction depends on the values in RT, RA, and RB, a 
machine that executes instructions out of order must treat RT in these 
instructions as both a use and a set. That is, an instruction that sets 
RT, followed by a conditional move that sets RT, must be executed in 
that order, and the result of the first instruction cannot be discarded. 
Thus, the designer of such a machine may elect to omit the 
conditional move instructions to avoid having to consider an 
instruction with (logically) three source operands. On the other hand, 
the conditional move instructions do save branches. 


Instruction formats are not relevant to the purposes of this book, but 
the full RISC instruction set described above, with floating point and a 
few supervisory instructions added, can be implemented with 32-bit 
instructions on a machine with 32 general purpose registers (5-bit 
register fields). By reducing the immediate fields of compare, load, 
store, and trap instructions to 14 bits, the same holds for a machine 
with 64 general purpose registers (6-bit register fields). 


Execution Time 


We assume that all instructions execute in one cycle, except for the 
multiply, divide, and remainder instructions, for which we do not 
assume any particular execution time. Branches take one cycle 
whether they branch or fall through. 


The load immediate instruction is counted as one or two cycles, 
depending on whether one or two elementary arithmetic instructions 
are required to generate the constant in a register. 


Although load and store instructions are not often used in this book, 
we assume they take one cycle and ignore any load delay (time lapse 


between when a load instruction completes in the arithmetic unit, and 
when the requested data is available for a subsequent instruction). 


However, knowing the number of cycles used by all the arithmetic 
and logical instructions is often insufficient for estimating the 
execution time of a program. Execution can be slowed substantially by 
load delays and by delays in fetching instructions. These delays, 
although very important and increasing in importance, are not 
discussed in this book. Another factor, one which improves execution 
time, is what is called "instruction-level parallelism," which is found in 
many contemporary RISC chips, particularly those for "high-end" 
machines. 


These machines have multiple execution units and sufficient 
instruction-dispatching capability to execute instructions in parallel 
when they are independent (that is, when neither uses a result of the 
other, and they don't both set the same register or status bit). Because 
this capability is now quite common, the presence of independent 
operations is often pointed out in this book. Thus, we might say that 
such and such a formula can be coded in such a way that it requires 
eight instructions and executes in five cycles on a machine with 
unlimited instruction-level parallelism. This means that if the 
instructions are arranged in the proper order ("scheduled"), a machine 
with a sufficient number of adders, shifters, logical units, and registers 
can in principle execute the code in five cycles. 


We do not make too much of this, because machines differ greatly in 
their instruction-level parallelism capabilities. For example, an IBM 
RS/6000 processor from ca. 1992 has a three-input adder, and can 
execute two consecutive add-type instructions in parallel even when 
one feeds the other (e.g., an add feeding a compare, or the base 
register of a load). As a contrary example, consider a simple computer, 
possibly for low-cost embedded applications, that has only one read 
port on its register file. Normally, this machine would take an extra 
cycle to do a second read of the register file for an instruction that has 
two register input operands. However, suppose it has a bypass so that 
if an instruction feeds an operand of the immediately following 
instruction, then that operand is available without reading the register 
file. On such a machine, it is actually advantageous if each instruction 
feeds the next—that is, if the code has no parallelism. 


Chapter 2. Basics 


Manipulating Rightmost Bits 

Addition Combined with Logical Operations 
Inequalities among Logical and Arithmetic Expressions 
Absolute Value Function 

Sign Extension 

Shift Right Signed from Unsigned 

Sign Function 

Three-Valued Compare Function 

Transfer of Sign 

Decoding a "Zero Means 2**n" Field 

Comparison Predicates 

Overflow Detection 

Condition Code Result of Add, Subtract, and Multiply 
Rotate Shifts 

Double-Length Add/Subtract 

Double-Length Shifts 

Multibyte Add, Subtract, Absolute Value 

Doz, Max, Min 

Exchanging Registers 


Alternating among Two or More Values 


2-1 Manipulating Rightmost Bits 


Some of the formulas in this section find application in later chapters. 


Use the following formula to turn off the rightmost 1-bit in a word, 
producing 0 if none (e.g., 01011000 =="01010000): 


x & (x—1) 


This may be used to determine if an unsigned integer is a power of 2; 
apply the formula followed by a 0-test on the result. 


Similarly, the following formula can be used to test if an unsigned 
integer is of the form 2" - 1 (including 0 or all 1's): 


x&(x+1) 


Use the following formula to isolate the rightmost 1-bit, producing 0 if 
none (e.g., 01011000 == 00001000): 


x & (-—x) 


Use the following formula to isolate the rightmost 0-bit, producing 0 if 
none (e.g., 10100111 =="00001000): 


~x & (x +1) 


Use one of the following formulas to form a mask that identifies the 
trailing O's, producing all 1's if x = 0 (e.g., 01011000 > 00000111): 


—x & (x—1). or 
=(x | —x), or 
(x & —x)-1 


The first formula has some instruction-level parallelism. 


Use the following formula to form a mask that identifies the rightmost 
1-bit and the trailing O's, producing all 1's if x = 0 (e.g., 01011000 


= 00001111): 
x®(x-1) 


Use the following formula to right-propagate the rightmost 1-bit, 
producing all 1's if x = 0 (e.g., 01011000 =="'01011111): 


(x-1) 


X 


Use the following formula to turn off the rightmost contiguous string 
of 1-bits (e.g., 01011000 =01000000): 


((x | (x-1) +1)&x 


This may be used to see if a nonnegative integer is of the form 2j - 2k 


for some j 2k =o; apply the formula followed by a 0-test of the 
result. 


These formulas all have duals in the following sense. Read what the 
formula does, interchanging 1's and 0's in the description. Then, in the 
formula, replace x - 1 with x + 1, x + 1 with x - 1, -x with -( + 1), 
& with |, and | with & Leave x and -x alone. Then the result is a valid 
description and formula. For example, the dual of the first formula in 
this section reads as follows: 


Use the following formula to turn on the rightmost 0-bit in a word, 
producing all 1's if none (e.g., 10100111 = 10101111): 


(x +1) 


€ 


There is a simple test to determine whether or not a given function 
can be implemented with a sequence of add's, subtract's, and's, or's, 
and not's [War]. We may, of course, expand the list with other 
instructions that can be composed from the basic list, such as shift left 
by a fixed amount (which is equivalent to a sequence of add's), or 
multiply. However, we exclude instructions that cannot be composed 
from the list. The test is contained in the following theorem. 


Theorem. A function mapping words to words can be implemented 
with word-parallel add, subtract, and, or, and not instructions if and 
only if each bit of the result depends only on bits at and to the right of 
each input operand. 


That is, imagine trying to compute the rightmost bit of the result by 
looking only at the rightmost bit of each input operand. Then, try to 
compute the next bit to the left by looking only at the rightmost two 
bits of each input operand, and so forth. If you are successful in this, 
then the function can be computed with a sequence of add's, and's, 
and so on. If the function cannot be computed in this right-to-left 
manner, then it cannot be implemented with a sequence of such 
instructions. 


The interesting part of this is the latter statement, and it is simply the 
contrapositive of the observation that the functions add, subtract, and, 
or, and not can all be computed in the right-to-left manner, so any 
combination of them must have this property. 


To see the "if" part of the theorem, we need a construction that is a 
little awkward to explain. We illustrate it with a specific example. 
Suppose that a function of two variables x and y has the right-to-left 
computability property, and suppose that bit 2 of the result r is given 
by 


Equation 1 


ry = x; | (Xo &N). 


We number bits from right to left, 0 to 31. Because bit 2 of the result 
is a function of bits at and to the right of bit 2 of the input operands, 
bit 2 of the result is "right-to-left computable." 


Arrange the computer words x, x shifted left two, and y shifted left 
one, as shown below. Also, add a mask that isolates bit 2. 


Vay Vag e Va Vy Fo 0 
0 oO... OL 0 0 
00..0r,00 


Now, form the word-parallel and of lines 2 and 3, or the result with 
row 1 (following Equation (1)), and and the result with the mask (row 
4 above). The result is a word of all 0's except for the desired result bit 
in position 2. Perform similar computations for the other bits of the 
result, or the 32 resulting words together, and the result is the desired 
function. 


This construction does not yield an efficient program; rather, it merely 
shows that it can be done with instructions in the basic list. 


Using the theorem, we immediately see that there is no sequence of 
such instructions that turns off the leftmost 1-bit in a word, because to 
see if a certain 1-bit should be turned off, we must look to the left to 
see if it is the leftmost one. Similarly, there can be no such sequence 
for performing a right shift, or a rotate shift, or a left shift by a 
variable amount, or for counting the number of trailing O's in a word 
(to count trailing 0's, the rightmost bit of the result will be 1 if there 
are an odd number of trailing 0's, and we must look to the left of the 
rightmost position to determine that). 


A novel application of the sort of bit twiddling discussed above is the 
problem of finding the next higher number after a given number that 
has the same number of 1-bits. You are forgiven if you are asking, 
"Why on earth would anyone want to compute that?" It has 
application where bit strings are used to represent subsets. The 
possible members of a set are listed in a linear array, and a subset is 
represented by a word or sequence of words in which bit i is on if 
member i is in the subset. Set unions are computed by the logical or of 
the bit strings, intersections by and's, and so on. 


You might want to iterate through all the subsets of a given size. This 
is easily done if you have a function that maps a given subset to the 
next higher number (interpreting the subset string as an integer) with 


the same number of 1-bits. 


A concise algorithm for this operation was devised by R. W. Gosper 
[HAK, item 175]. [1] Given a word x that represents a subset, the idea 
is to find the rightmost contiguous group of 1's in x and the following 
O's, and "increment" that quantity to the next value that has the same 
number of 1's. For example, the string xxx0 1111 0000, where xxx 
represents arbitrary bits, becomes xxx1 0000 0111. The algorithm first 
identifies the "smallest" 1-bit in x, with s = x & -x, giving 
000000010000. This is added to x, giving r = xxx100000000. The 1- 
bit here is one bit of the result. For the other bits, we need to produce 
a right-adjusted string of n - 1 1's, where n is the size of the rightmost 
group of 1's in x. This can be done by first forming the exclusive or of 
r and x, which gives 0001 1111 0000 in our example. 


[1] A variation of this algorithm appears in [H&S] sec. 7.6.7. 

This has two too many 1's, and needs to be right-adjusted. This can be 
accomplished by dividing it by s, which right-adjusts it (s is a power 
of 2), and shifting it right two more positions to discard the two 
unwanted bits. The final result is the or of this and r. 


In computer algebra notation, the result is y in 


Equation 2 

sex &—-x 

restx 

yor | (x @r) 2) 4s) 


A complete C procedure is given in Figure 2-1. It executes in seven 
basic RISC instructions, one of which is division. (Do not use this 
procedure with x = 0; that causes division by 0.) 


Figure 2-1 Next higher number with same number of 1-bits. 


unsigned snoob(unsigned x) { 
unsigned smallest, ripple, ones; 


// x = xxx0 1111 0000 


smallest = x & -x; // 0000 0001 0000 


ripple = x + smallest; // xxx1 0000 0000 
ones = x ^ ripple; // 0001 1111 0000 
ones = (ones >> 2)/smallest; // 0000 0000 0111 
return ripple | ones; // xxxl 0000 0111 


} 


If division is slow but you have a fast way to compute the number of 
trailing zeros function ntz(x), the number of leading zeros function 
nlz(x), or population count (pop(x) is the number of 1-bits in x), then 
the last line of Equation (2) can be replaced with one of the following: 


yer | ((x Br) 3 (2 + ntz(x))) 
yer | ((x @r) = (33 -nlz(s))) 
yer | ((1 << (pop(x @ r)-—2))-1) 


2-2 Addition Combined with Logical Operations 


We assume the reader is familiar with the elementary identities of 
ordinary algebra and Boolean algebra. Below is a selection of similar 
identities involving addition and subtraction combined with logical 
operations: 


5 -x= axt1 

b. = a(x — 1) 

a ax = -x-1 

i -nx = x+] 

e. =-x =x-l1 

; x+y =x--ay7-1 

n = (x@ y)+t2(x& y) 
h. = (x | y)+(x& y) 

i. =2(x | y)-(x® y) 
j. x-řy = x+ap+til 

k. = (xO y)-2(4x & y) 
L. = (x& ay) -(4x & y) 
m. =2(x&ay)—-(x@ y) 
. x@y = (x | y)-(x& y) 


x&oap=(x|y)-pyp 


=x-(x&y) 


p. 
P ~x- y) = y-x-i 

r. =e P 

; x=y = (x&y)—(x | y)-1 
t. = (x&y)+ (x | y) 
7 x | y=(x&-y)ty 

_ x& y = (ax | y)--2x 


Equation (d) may be applied to itself repeatedly, giving ----x = x + 
2, and so on. Similarly, from (e) we have -----x = x - 2. So we can 
add or subtract any constant, using only the two forms of 
complementation. 


Equation (f) is the dual of (j), where (j) is the well-known relation that 
shows how to build a subtracter from an adder. 


Equations (g) and (h) are from HAKMEM memo [HAK, item 23]. 
Equation (g) forms a sum by first computing the sum with carries 


ignored (x Ly) and then adding in the carries. Equation (h) is simply 
modifying the addition operands so that the combination 0 + 1 never 


occurs at any bit position; it is replaced with 1 + 0. 


It can be shown that in the ordinary addition of binary numbers with 
each bit independently equally likely to be 0 or 1, a carry occurs at 
each position with probability about 0.5. However, for an adder built 
by preconditioning the inputs using (g), the probability is about 0.25. 
This observation is probably not of value in building an adder, 
because for that purpose the important characteristic is the maximum 
number of logic circuits the carry must pass through, and using (g) 
reduces the number of stages the carry propagates through by only 
one. 


Equations (k) and (l) are duals of (g) and (h), for subtraction. That is, 
(k) has the interpretation of first forming the difference ignoring the 


borrows (x Ly), and then subtracting the borrows. Similarly, 


Equation (1) is simply modifying the subtraction operands so that the 
combination 1 - 1 never occurs at any bit position; it is replaced with 
0-0. 


Equation (n) shows how to implement exclusive or in only three 
instructions on a basic RISC. Using only and-or-not logic requires four 
instructions ((x | y) & —(x & y)). Similarly, (u) and (v) show how to 
implement and and or in three other elementary instructions, whereas 
using DeMorgan's laws requires four. 


2-3 Inequalities among Logical and Arithmetic 
Expressions 


Inequalities among binary logical expressions whose values are 
interpreted as unsigned integers are nearly trivial to derive. Here are 
two examples: 


(x@®y)<(x | y) and 


(x& py) S(x=y), 


These can be derived from a list of all binary logical operations, 
shown in Table 2-1. 


Let f(x, y) and g(x, y) represent two columns in Table 2-1. If for each 
row in which f(x, y) is 1, g(x, y) also is 1, then for all (x, y), 


oF + ol = K 
KX, y) E gix, y l- Clearly, this extends to word-parallel logical 
operations. One can easily read off such relations (most of which are 


Pa e aa e a . | a 
trivial) as |" &y) =i = (3 |= land so on. Furthermore, if two 
columns have a row in which one entry is 0 and the other is 1, and 
another row in which the entries are 1 and 0, respectively, then no 
inequality relation exists between the corresponding logical 


k j H n 2 ; 
expressions. So the question of whether or not Í (x, ¥) Š glx, ¥) 
is completely and easily solved for all binary logical functions f and g. 


Table 2-1. The 16 Binary Logical Operations 


S5999 


Use caution when manipulating these relations. For example, for 


< < < 


ordinary arithmetic, if x + y =a and z =x, then z + y =a. But this 
inference is not valid if "+" is replaced with or. 


Inequalities involving mixed logical and arithmetic expressions are 
more interesting. Below is a small selection. 


x | y) 3 max(x, y) 


a. 

b. (x & y) < min(x, y) 

@: (x | y)<x+y_ ifthe addition does not overflow 
d. (x | y) x+y ifthe addition overflows 
2 Ix—yl S(x@ y) 


The proofs of these are quite simple, except possibly for the relation 


Ef 

|x- y S(x@y). By |x - y| we mean the absolute value of x - y, 

which may be computed within the domain of unsigned numbers as 
hich may : | hi he d f d b 


max(x, y) - min(x, y). This relation may be proven by induction on the 
length of x and y (the proof is a little easier if you extend them on the 
left rather than on the right). 


2-4 Absolute Value Function 


If your machine does not have an instruction for computing the 

absolute value, this computation can usually be done in three or four 
5 

, ; ; |x jl 

branch-free instructions. First, compute + and then one 

of the following: 


abs nabs 
(xBy)—y y-(x@ y) 
(x+y) Dy (y-x) Oy 


x-(2x & y) (2x & y)-x 


By "2x" we mean, of course, x + xorx << 1. 


If you have a fast multiply by a variable whose value is +1, the 
following will do: 


((x > 30) | 1)=x 


2-5 Sign Extension 


By "sign extension," we mean to consider a certain bit position in a 
word to be the sign bit, and we wish to propagate that to the left, 
ignoring any other bits present. The standard way to do this is with 
shift left logical followed by shift right signed. However, if these 
instructions are slow or nonexistent on your machine, it may be done 
with one of the following, where we illustrate by propagating bit 
position 7 to the left: 


((x + 0x00000080) & 0x000000FF) — 0x00000080 
((x & 0x000000FF) @ 0x00000080) — 0x0000 0080 


The "+" above can also be "-" or D. The second formula is __ 
particularly useful if you know that the unwanted high-order bits are 


all O's, because then the and can be omitted. 


2-6 Shift Right Signed from Unsigned 


If your machine does not have the shift right signed instruction, it may 
be computed using the formulas shown below. The first formula is 
from [GM], and the second is based on the same idea. Assuming the 


machine has mod 64 shifts, the first four formulas hold for 0 Sn < 


31, and the last holds for 0 Sn S63. The last formula holds for any 
n if by "holds" we mean "treats the shift amount to the same modulus 


as does the logical shift." 


When n is a variable, each formula requires five or six instructions on 
a basic RISC. 


((x + 0x80000000) = 2) — (0x80000000 £ 7) 
t — 0x8000 0000 = 7; (xn) @n-t 
f<— (x & Ox80000000) = ny (xn) (t+ e) 
(xen) | (-(x31)<31-n) 


f<—-(x> 31); ((x@nsa)er 


In the first two formulas, an alternative for the expression 


: it 

Ox8000 0000 = Aisi << 31-n. 

If n is a constant, the first two formulas require only three instructions 
on many machines. If n = 31, the function can be done in two 


z „H = 
instructions with (x >> 31). 


2-7 Sign Function 


The sign, or signum, function is defined by 


-|1, x< 0, 
sign(x) = 4 0, x =O, 
l, «>0 


If you don't have shift right signed, then use the substitute noted at 
the end of Section 2-6, giving the following nicely symmetric formula 
(five instructions): 


—(x>>31)| (x = 31) 


Comparison predicate instructions permit a three-instruction solution, 
with either 


Equation 3 
(x>0)—(x <0), or 
(x>0)-(x<0). 


$31)-(x $31) 


Finally, we note that the formula (= A> 
almost works; it fails only for x = -231, 


2-8 Three-Valued Compare Function 


The three-valued compare function, a slight generalization of the sign 
function, is defined by 


—| š X T F, 
emp(x,y) =4 0, x=y, 
l xy. 


There are both signed and unsigned versions, and unless otherwise 
specified, this section applies to both. 


Comparison predicate instructions permit a three-instruction solution, 
an obvious generalization of Equations (3): 


(x>y¥)-(x<y), or 
(x2 y)—-(x sy). 


A solution for unsigned integers on PowerPC is shown below [CWG]. 
On this machine, "carry" is "not borrow." 


subf R5,Ry,Rx # R5 <-- Rx - Ry. 

subfc R6,Rx, Ry # R6 <-- Ry - Rx, set carry. 

subfe R7,Ry,Rx # R7 <-- Rx - Ry + carry, set carry. 
subfe R8,R7,R5 # R8 <-- R5 - R7 + carry, (set carry). 


If limited to the instructions of the basic RISC, there does not seem to 
be any particularly good way to a. this function. The 


comparison predicates x < y, x Sy, and so on, require about five 
instructions eee Section 2-11), leading to a solution in about 12 


instructions (using a small amount of commonality in computing x < 
y and x > y). On the basic RISC it's probably preferable to use 
compares and branches (six instructions executed worst case if 
compares can be commoned). 


2-9 Transfer of Sign 


The transfer of sign function, called ISIGN in Fortran, is defined by 


abs(x), v2, 


ISIGN(x, ¥) = 
abs(x), <0, 


This function can be calculated (modulo 232) with four instructions on 
most machines: 


te ys3l; te (x@ y) 31; 
ISIGN(x, y) = (abs(x) ®@r)-t ISIGN(x, y) = (x @r)-t 
= (abs(x)+1t) Ot =(x+t)@r 


2-10 Decoding a "Zero Means 2**n" Field 


Sometimes a 0 or negative value does not make much sense for a 
quantity, so it is encoded in an n-bit field with a 0 value being 
understood to mean 2", and a non-zero value having its normal binary 
interpretation. An example is the length field of PowerPC's load string 
word immediate (1 swi) instruction, which occupies five bits. It is not 
useful to have an instruction that loads zero bytes, when the length is 
an immediate quantity, but it is definitely useful to be able to load 32 
bytes. The length field could be encoded with values from 0 to 31 
denoting lengths from 1 to 32, but the "zero means 32" convention 
results in simpler logic when the processor must also support a 
corresponding instruction with a variable (in-register) length that 
employs straight binary encoding (e.g., PowerPC's 1swx instruction). 


It is trivial to encode an integer in the range 1 to 2” into the "zero 
means 22" encoding—simply mask the integer with 2» - 1. To do the 
decoding without a test-and-branch is not quite as simple, but below 
are some possibilities (no doubt overdone), illustrated for a 3-bit field. 
They all require three instructions, not counting possible loads of 
constants. 


((x-1)&7)+1 ((x+7) | -—8)+9 8 —(-x & 7) 
((x+7)&7)+1 ((x+7) | 8)-7 (—x | —8) 
((x-—1) | -8)+9 ((x-1)&8)+x 


2-11 Comparison Predicates 


A "comparison predicate" is a function that compares two quantities, 
producing a single bit result of 1 if the comparison is true, and 0 if the 
comparison is false. Below we show branch-free expressions to 
evaluate the result into the sign position. To produce the 1/0 value 
used by some languages (e.g., C), follow the code with a shift right of 
31. To produce the -1/0 result used by some other languages (e.g., 
Basic), follow the code with a shift right signed of 31. 


These formulas are, of course, not of interest on machines such as 
MIPS, the Compaq Alpha, and our model RISC, which have 
comparison instructions that compute many of these predicates 
directly, placing a 0/1-valued result in a general purpose register. 


A machine instruction that computes the negative of the absolute 
value is handy here. We show this function as "nabs." Unlike absolute 
value, it is well defined in that it never overflows. Machines that do 
not have "nabs" but have the more usual "abs" can use -abs(x) for 
nabs(x). If x is the maximum negative number, this overflows twice, 
but the result is correct. (We assume that the absolute value and the 
negation of the maximum negative number is itself.) Because some 
machines have neither "abs" nor "nabs," we give an alternative that 
does not use them. 


The "nlz" function is the number of leading zeros in its argument. The 
"doz" function (difference or zero) is described on page 37. 


x= y: abs(x — y)- 1 
abs(x — y + 0x80000000) 
nlz(x — y) << 26 
-(nlz(x — y) > 5) 


—(x-—y | y-x) 


xy: nabs(x — y) 
nlz(x — y)—32 
x-y|y-x 


x<y: (x-—y) ®[(x@ y) & ((x- y) Bx)] 
(x& ay) | ((x=y) & (x-y)) 


nabs(doz(y, x)) [GSO] 
xsy: (x | my) & ((x@® y) | ~(y--x)) 
((x=y)> 1) +(x&y) [GSO] 


xy: (ax & y) | ((x=y) & (x-y)) 
(=x & y) | (nx | y) & (x-y)) 
xy: (~x | y) &((x® y) | -(y--)) 


For x > v, x 2y, and so on, interchange x and y in the formulas for x 


< y x =y, and so on. The add of 0x80000000 may be replaced with 
any instruction that inverts the high-order bit (in x, y, or x - y). 


Another class of formulas can be derived from the observation that the 
predicate x < y is given by the sign of x/2 - y/2, and the subtraction 
in that expression cannot overflow. The result can be fixed up by 
subtracting 1 in the cases in which the shifts discard essential 
information, as follows: 


x<y: (x= l)-(y1l)-(G4x& y&l) 
X<y: (x 1)-(y$1)-(-x& p&1) 


These execute in seven instructions on most machines (six if it has and 
not), which is no better than what we have above (five to seven 
instructions, depending upon the fullness of the set of logic 
instructions). 


The formulas above involving "nlz" are due to [Shep], and his formula 
for the x = y predicate is particularly useful because a minor 
variation of it gets the predicate evaluated to a 1/0-valued result with 
only three instructions: 


nlz(x— p) = 5. 


Signed comparisons to 0 are frequent enough to deserve special 
mention. Below are some formulas for these, mostly derived directly 
from the above. Again, the result is in the sign position. 


xz#0: 


x<0: 
x=: 


x>O: 


x20: 


abs(x) — 1 

abs(x + 0x80000000) 
nlz(x) << 26 
—(nla(x) = 5) 

=(x | —x) 
=xv&(x—-—I1) 
nabs(.x) 

nlz(x) — 32 

x | -x 

(x> l)-x [CWG] 
x 

x | (x-1) 

x | a4-x 

x @ nabs{x) 

(x= 1l)-x 

—x & ax 


=X 


Signed comparisons can be obtained from their unsigned counterparts 
by biasing the signed operands upwards by 23! and interpreting the 
results as unsigned integers. The reverse transformation also works. 


Thus we have 


x<yp = x+231 2 pt 231, 


xep = x-—25l < p— 231, 


a l Z g z 
Similar relations hold for — * = sand so on. Addition and ; 
subtraction of 231 are equivalent, as they amount to inverting the sign 


bit. 


Another way to get signed comparisons from unsigned is based on the 


+ H, ay 
fact that if x and y have the same sign, then rP as Fs 


sdy = all 
whereas if they have opposite signs, then X<) a> 


F [Lamp]. 
Again, the reverse transformation also works, so we have 


x<yp = (x= y)@x;,@Oy3, and 


E, 


xy = (x<p) Gx, P ps), 


where x3,and yz) are the sign bits of x and y, respectively. Similar 
li 


relations hold for — * = *and so on. 


Using either of these devices enables computing all the usual 


comparison predicates other than = and Fin terms of any one of 

them, with at most three additional instructions on most machines. 
fd 

XSF ee PE 

For example, let us take ~ as primitive, because it is one of the 

simplest to implement (it is the carry bit from y - x). Then the other 


predicates can be obtained as follows: 


Comparison Predicates from the Carry Bit 


If the machine can easily deliver the carry bit into a general purpose 
register, this may permit concise code for some of the comparison 
predicates. Below are listed several of these relations. The notation 
carry(expression) means the carry bit generated by the outermost 
operation in expression. We assume the carry bit for the subtraction x 
- y is what comes out of the adder for x + y7 + 1, which is the 
complement of "borrow." 


x=y: carry(0 — (x — y)), or carry((x + y) + 1), or 
carry((x— y—1) + 1) 


x+y: carry((x — y) — 1), i.e., carry((x — y) + (-1)) 
x<y: —carry((x + 231) — (y + 231)) 

xsy: carry((y + 231) — (x + 231)) 

x<y: —carry(x — y) 

xŚ y: carry(y — x) 

x= 0: carry(0 — x), or carry(x + 1) 

ee: carry(x — 1), i.e., carry(x + (-1)) 

eb |b carry(x + x) 

x<0: carry(23! — (x + 23!)) 


For x > y, use the complement of the expression for x £y, and 
similarly for other relations involving "greater than." 


The GNU Superoptimizer has been applied to the problem of 
computing predicate expressions on the IBM RS/6000 computer and 
its close relative PowerPC [GK]. The RS/6000 has instructions for 
abs(x), nabs(x), doz(x, y), and a number of forms of add and subtract 
that use the carry bit. It was found that the RS/6000 can compute all 
the integer predicate expressions with three or fewer elementary (one- 
cycle) instructions, a result that surprised even the architects of the 
machine. "All" includes the six two-operand signed comparisons and 
the four two-operand unsigned comparisons, all of these with the 
second operand being 0, and all in forms that produce a 1/0 result or 
a -1/0 result. PowerPC, which lacks abs(x), nabs(x), and doz(x, y), can 
compute all the predicate expressions in four or fewer elementary 
instructions. 


How the Computer Sets the Comparison Predicates 
Most computers have a way of evaluating the integer comparison 


predicates to a 1-bit result. The result bit may be placed ina 
"condition register" or, for some machines (such as our RISC model), 


in a general purpose register. In either case, the facility is often 
implemented by subtracting the comparison operands and then 
performing a small amount of logic on the result bits to determine the 
1-bit comparison result. 


Below is the logic for these operations. It is assumed that the machine 
computes x - yasx + y + 1, and the following quantities are 
available in the result: 


Co, the carry out of the high-order position 
Cj, the carry into the high-order position 
N, the sign bit of the result 


Z, which equals 1 if the result, exclusive of Co, is all-O, and is 
otherwise 0 


Then we have the following in Boolean algebra notation (juxtaposition 
denotes and, + denotes or): 


F: C ; C e (signed overflow) 
x=yp: Ë 
x#y: Z 
x<y: NBF 
xsy: (N@®P)+2 
x>y! (N=F)Z 
x2 yp: NaF 
ety: C. 


2-12 Overflow Detection 


"Overflow" means that the result of an arithmetic operation is too 
large or too small to be correctly represented in the target register. 
This section discusses methods that a programmer might use to detect 
when overflow has occurred, without using the machine's "status bits" 
that are often supplied expressly for this purpose. This is important 
because some machines do not have such status bits (e.g., MIPS), and 
because even if the machine is so equipped, it is often difficult or 
impossible to access the bits from a high-level language. 


Signed Add/Subtract 


When overflow occurs on integer addition and subtraction, 
contemporary machines invariably discard the high-order bit of the 
result and store the low-order bits that the adder naturally produces. 
Signed integer overflow of addition occurs if and only if the operands 
have the same sign and the sum has sign opposite to that of the 
operands. Surprisingly, this same rule applies even if there is a carry 
into the adder—that is, if the calculation is x + y + 1. This is 
important for the application of adding multiword signed integers, in 
which the last addition is a signed addition of two fullwords and a 
carry-in that may be 0 or +1. 


To prove the rule for addition, let x and y denote the values of the 
one-word signed integers being added, let c (carry-in) be 0 or 1, and 


assume for simplicity a 4-bit machine. Then if the signs of x and y are 
different, 


RS x=], and 


Osysi, 


or similar bounds apply if x is nonnegative and y is negative. In either 
case, by adding these inequalities and optionally adding in 1 for c, 


—Sextyptesi, 


This is representable as a 4-bit signed integer, and thus overflow does 


not occur when the operands have opposite signs. 


Now suppose x and y have the same sign. There are two cases: 


-8$<=yxs-] QOsxrs7 
Ssyps-l Oss? 
Thus, 


(a) (b) 


l6extytes-l OSxtytes 15, 


Overflow occurs if the sum is not representable as a 4-bit signed 
integer— that is, if 


(a) (b) 


l6ext+ypt+es-9 $ext ypres 15, 


In case (a), this is equivalent to the high-order bit of the 4-bit sum 
being 0, which is opposite to the sign of x and y. In case (b), this is 
equivalent to the high-order bit of the 4-bit sum being 1, which again 
is opposite to the sign of x and y. 


For subtraction of multiword integers, the computation of interest is x 
- y - c where again c is 0 or 1, with a value of 1 representing a borrow- 
in. From an analysis similar to the above, it can be seen that overflow 
in the final value of x - y - c occurs if and only if x and y have opposite 
signs and the sign of x - y - c is opposite to that of x (or, equivalently, 
the same as that of y). 


This leads to the following expressions for the overflow predicate, 
with the result being in the sign position. Following these with a shift 
right or shift right signed of 31 produces a 1/0- or a -1/0-valued 
result. 


xtyte x-y-c 
(x=y)&((xt+y+c)@x) (x ® y) & ((x—- y—c) Ox) 
((x+ytce) Oxy &((xtytec)Oy) ((x-y-—c) Ox) &((x-y-c)z=y) 


By choosing the second alternative in the first column, and the first 
alternative in the second column (avoiding the equivalence 
operation), our basic RISC can evaluate these tests with three 
instructions in addition to those required to compute x + y + corx- 
y - c. A fourth instruction (branch if negative) may be added to branch 
to code where the overflow condition is handled. 


If executing with overflow interrupts enabled, the programmer may 
wish to test to see if a certain addition or subtraction will cause 
overflow, in a way that does not cause it. One branch-free way to do 
this is as follows: 


xtyte x-y-ce 
z & (x= y) & 0x80000000 z<(x® y) & 0x80000000 
(x=y) &((xGz)+ytej=y (x@ y) & ((x@z)-y—c) Oy 


The assignment to z in the left column sets z = 0x80000000 if x and y 
have the same sign, and sets z = 0 if they differ. Then, the addition in 
the second expression is done with x and y having different signs, so it 
can't overflow. If x and y are nonnegative, the sign bit in the second 


expression will be 1 if and only if (x - 231) + y + c = that is, iff x 
+ y + c =231, which is the condition for overflow in evaluating x + 
y + c. Ifx andy are negative, the sign bit in the second expression 
will be 1 iff (x + 231) + y + c < O0—that is, iff x + y + c < -231, 
which again is the condition for overflow. The term x =y ensures the 
correct result (0 in the sign position) if x and y have opposite signs. 
Similar remarks apply to the case of subtraction (right column). The 
code executes in nine instructions on the basic RISC. 


It might seem that if the carry from addition is readily available, this 
might help in computing the signed overflow predicate. This does not 
seem to be the case. However, one method along these lines is as 
follows. 


If x is a signed integer, then x + 231 is correctly represented as an 
unsigned number, and is obtained by inverting the high-order bit of x. 


Signed overflow in the positive direction occurs if x + y = 931_that 
is, if (x + 231) + (y + 231) =3- 231. This latter condition is 
characterized by carry occurring in the unsigned add (which means 
that the sum is greater than or equal to 232) and the high-order bit of 
the sum being 1. Similarly, overflow in the negative direction occurs if 
the carry is 0 and the high-order bit of the sum is also 0. 


This gives the following algorithm for detecting overflow for signed 
addition: 


Compute (x Bzy + (y By, giving sum s and carry c. 


Overflow occurred iff c equals the high-order bit of s. 


The sum is the correct sum for the signed addition, because inverting 
the high-order bits of both operands does not change their sum. 


For subtraction, the algorithm is the same except that in the first step 
a subtraction replaces the addition. We assume that the carry is that 
generated by computing x - y as x + y + 1. The subtraction is the 
correct difference for the signed subtraction. 


These formulas are perhaps interesting, but on most machines they 
would not be quite as efficient as the formulas that do not even use 


ra nay overflow = (x Œy) & (s By for addition, and (x 


)&(d x)for subtraction, where s and d are the sum and 
difference, respectively, of x and y). 


How the Computer Sets Overflow for Signed Add/Subtract 


Machines often set "overflow" for signed addition by means of the 
logic "the carry into the sign position is not equal to the carry out of 
the sign position." Curiously, this logic gives the correct overflow 
indication for both addition and subtraction, assuming the subtraction 
x-yisdone byx + y + 1. Furthermore, it is correct whether or not 
there is a carry- or borrow-in. This does not seem to lead to any 
particularly good methods for computing the signed overflow 
predicate in software, however, even though it is easy to compute the 
carry into the sign position. For addition and subtraction, the carry/ 
borrow into the sign position is given by the sign bit after evaluating 
the following expressions (where c is 0 or 1): 


carry borrow 
(x+ytec) Oxy (x-y-c)OxOyp 


In fact, these expressions give, at each position i, the carry/borrow 
into position i. 


Unsigned Add/Subtract 


The following branch-free code may be used to compute the overflow 
predicate for unsigned add/subtract, with the result being in the sign 
position. The expressions involving a right shift are probably useful 
only when it is known that c = 0. The expressions in brackets 
compute the carry or borrow generated from the least significant 
position. 


x+y+e, unsigned 
(x& y) | (x | y)&A(xt+ pteo)) 
(x 1)+(yS1)+((x&y) | (x | py) & e)) &1] 


x — y- c, unsigned 
(ax& y) | (x= y) & (x-y-c)) 
(nx & y) | (ax | y) &(x- y-c)) 
(x 1)-(y1)-[M((ax& y) | (ax | y)&c))&1] 


For unsigned add's and subtract's, there are much simpler formulas in 
terms of comparisons [MIPS]. For unsigned addition, overflow (carry) 
occurs if the sum is less (by unsigned comparison) than either of the 
operands. This and similar formulas are given below. Unfortunately, 
there is no way in these formulas to allow for a variable c that 
represents the carry- or borrow-in. Instead, the program must test c, 
and use a different type of comparison depending upon whether c is 0 
or 1. 


x+y,unsigned x+y+1,unsigned x—y,unsigned *x—y—1, unsigned 
=x Ëy ax<y xéy x<y 


t 


xtyéx x+y+lśx x-y>x x-y-13x 


The first formula for each case above is evaluated before the add/ 
subtract that may overflow, and it provides a way to do the test 
without causing overflow. The second formula for each case is 
evaluated after the add/subtract that may overflow. 


There does not seem to be a similar simple device (using comparisons) 
for computing the signed overflow predicate. 


Multiplication 


For multiplication, overflow means that the result cannot be expressed 
in 32 bits (it can always be expressed in 64 bits, whether signed or 
unsigned). Checking for overflow is simple if you have access to the 
high-order 32 bits of the product. Let us denote the two halves of the 
64-bit product by hi(x x y) and lo(x x y). Then the overflow predicates 
can be computed as follows [MIPS]: 


xx y. unsigned xX y, signed 
hi(x x y) #0 hi(x x p) # (lo(x x p) = 31) 


One way to check for overflow of multiplication is to do the 
multiplication and then check the result by dividing. But care must be 
taken not to divide by 0, and there is a further complication for signed 
multiplication. Overflow occurs if the following expressions are true: 


Unsigned Signed 
Ze-xX*Y Ze-xey 
y #0 &z y#x (y <0 &x = -2?!) | (y0 &z+y +x) 


The complication arises when x = -231 and y = -1. In this case the 
multiplication overflows, but the machine may very well give a result 


of -231, This causes the division to overflow, and thus any result is 
possible (for some machines). Therefore, this case has to be checked 
separately, which is done by the term y < 0 & x = -231, The above 
expressions use the "conditional and" operator to prevent dividing by 
0 (in C, use the && operator). 


It is also possible to use division to check for overflow of 
multiplication without doing the multiplication (that is, without 
causing overflow). For unsigned integers, the product overflows iff xy 
> 232-1, or x > ((232- 1)/ y, or, since x is an integer, 


; pez yy . ; : sk 
x L(2 1 J trpressed in computer arithmetic, this is 


y0 xl (0xFFFFFFFF £ y). 


For signed integers, the determination of overflow of x * y is not so 
simple. If x and y have the same sign, then overflow occurs iff xy > 
231 - 1. If they have opposite signs, then overflow occurs iff xy < -231, 
These conditions may be tested as indicated in Table 2-2, which 
employs signed division. 


Table 2-2. Overflow Test for Signed Multiplication 


y S$ 
y =x 0x%BBÐDÐÐÐÐ = K 
xr #08 y < Ux7FFFFFFF +x 


This test is awkward to implement because of the four cases. It is 
difficult to unify the expressions very much because of problems with 
overflow and with not being able to represent the number + 231, 


The test can be simplified if unsigned division is available. We can use 
the absolute values of x and y, which are correctly represented under 
unsigned integer interpretation. The complete test can then be 
computed as shown below. The variable c = 231 - 1 if x and y have 
the same sign, and c = 23! otherwise. 


ee ((x=yp)31)+251 
x <— abs(.x) 

y+ abs(y) 
y40&xi(e4y) 


The number of leading zeros instruction may be used to give an 
estimate of whether or not x * y will overflow, and the estimate may 
be refined to give an accurate determination. First, consider the 
multiplication of unsigned numbers. It is easy to show that if x and y, 
as 32-bit quantities, have m and n leading 0's, respectively, then the 
64-bit product has either m + norm + n + 1 leading 0's (or 64, if 
either x = 0 or y = 0). Overflow occurs if the 64-bit product has 
fewer than 32 leading 0's. Hence, 


nlz(x) + nlz(y) > 32: Multiplication definitely does not overflow. 


niz(x) + nlz(y) < 30: Multiplication definitely does overflow. 


For nlz(x) + nlz(y) = 31, overflow may or may not occur. In this 
case, the overflow assessment may be made by evaluating 


= y| yI 
fsx Ly f2J "This will not overflow. Since xy is 2t or, if y is odd, 


2t + x, the product xy overflows if t =23!. These considerations lead 
to a plan for computing xy but branching to "overflow" if the product 


overflows. This plan is shown in Figure 2-2. 
Figure 2-2 Determination of overflow of unsigned multiplication. 


unsigned x, Y, Z, Mm, N, t; 


nilz (x); 


3 
ll 


n = nlz(y); 
if (m + n <= 30) goto overflow; 


t = x*(y >> 1); 


if ((int)t < 0) goto overflow; 
Z = t*2; 
if (y & 1) 4 
Z= Z + Xi 
if (z < x) goto overflow; 
} 
// z is the correct product of x and y. 


For the multiplication of signed integers, we can make a partial 
determination of whether or not overflow occurs from the number of 
leading 0's of nonnegative arguments, and the number of leading 1's of 
negative arguments. Let 


m = nlz(x)+ nlz(x), and 
n = nlz(y)+ nlz(p). 


Then, we have 


m + n 234: Multiplication definitely does not overflow. 


m+n <31: Multiplication definitely does overflow. 


There are two ambiguous cases: 32 and 33. The case m + n = 33 
overflows only when both arguments are negative and the true 
product is exactly 231 (machine result is -231), so it can be recognized 
by a test that the product has the correct sign (that is, overflow 


occurred if m n (m * n) < 0). Whenm + n = 32, the 
distinction is not so easily made. 


We will not dwell on this further, except to note that an overflow 
estimate for signed multiplication can also be made based on 
nlz(abs(x)) + nlz(abs(y)), but again there are two ambiguous cases (a 
sum of 31 or 32). 


Division 


For the signed division x + y, overflow occurs if the following 
expression is true: 


y=0 | (x = 0x80000000 & y =-1) 


Most machines signal overflow (or trap) for the indeterminate form 0 
= 0, 


Straightforward code for evaluating this expression, including a final 
branch to the overflow handling code, consists of seven instructions, 
three of which are branches. There do not seem to be any particularly 
good tricks to improve on this, but below are a few possibilities: 


[abs(y © 0x80000000) | (abs(x) & abs(y = 0x80000000))] <0 


That is, evaluate the large expression in brackets, and branch if the 
result is less than 0. This executes in about nine instructions, counting 
the load of the constant and the final branch, on a machine that has 
the indicated instructions and that gets the "compare to 0" for free. 


Some other possibilities are to first compute z from 


z— (x © 0x80000000) | (y+ 1) 


(three instructions on many machines), and then do the test and 
branch on y = 0 | z = O in one of the following ways: 


(Cy | —y) & (z | -z)) 20 

(nabs{y) & nabs(z)) = 0 

((nlz(y) | nlz(z)) = 5)#0 

These execute in nine, seven, and eight instructions, respectively, on a 


machine that has the indicated instructions. The last line represents a 
good method for PowerPC. 


~ 4, 
For the unsigned division ad *, overflow occurs if and only if y = 
0. 


2-13 Condition Code Result of Add, Subtract, and Multiply 


Many machines provide a "condition code" that characterizes the 
result of integer arithmetic operations. Often there is only one add 
instruction, and the characterization reflects the result for both 
unsigned and signed interpretation of the operands and result (but not 
for mixed types). The characterization usually consists of the 
following: 


Whether or not carry occurred (unsigned overflow) 
Whether or not signed overflow occurred 


Whether the 32-bit result, interpreted as a signed two's- 
complement integer and ignoring carry and overflow, is 
negative, 0, or positive 


Some older machines give an indication of whether the infinite 
precision result (that is, 33-bit result for add's and subtract's) is 
positive, negative, or 0. However, this indication is not easily used by 
compilers of high-level languages, and so has fallen out of favor. 


For addition, only nine of the 12 combinations of these events are 
possible. The ones that cannot occur are "no carry, overflow, result > 
0," "no carry, overflow, result = 0," and "carry, overflow, result < 0." 
Thus, four bits are, just barely, needed for the condition code. Two of 
the combinations are unique in the sense that only one value of inputs 
produces them: Adding 0 to itself is the only way to get "no carry, no 
overflow, result = 0," and adding the maximum negative number to 
itself is the only way to get "carry, overflow, result = 0." These 
remarks remain true if there is a "carry in"—that is, if we are 
computing x + y + 1. 


For subtraction, let us assume that to compute x - y the machine 
actually computes x + y + 1, with the carry produced as for an add 
(in this scheme the meaning of "carry" is reversed for subtraction, in 
that carry = 1 signifies that the result fits in a single word, and carry 
= 0 signifies that the result does not fit in a single word). Then for 
subtraction only seven combinations of events are possible. The ones 
that cannot occur are the three that cannot occur for addition, plus 
"no carry, no overflow, result = 0," and "carry, overflow, result = 0." 


If a machine's multiplier can produce a doubleword result, then two 
multiply instructions are desirable: one for signed and one for 


unsigned operands. (On a 4-bit machine, in hexadecimal, F x F = 01 
signed, and F x F = E1 unsigned). For these instructions, neither carry 
nor overflow can occur, in the sense that the result will always fit in a 
doubleword. 


For a multiplication instruction that produces a one-word result (the 
low-order word of the doubleword result), let us take "carry" to mean 
that the result does not fit in a word with the operands and result 
interpreted as unsigned integers, and let us take "overflow" to mean 
that the result does not fit in a word with the operands and result 
interpreted as signed two's-complement integers. Then again, there are 
nine possible combinations of results, with the missing ones being "no 
carry, overflow, result > 0," "no carry, overflow, result = 0," and 
"carry, no overflow, result = 0." Thus, considering addition, 
subtraction, and multiplication together, ten combinations can occur. 


2-14 Rotate Shifts 


These are rather trivial. Perhaps surprisingly, this code works for n 
ranging from 0 to 32 inclusive, even if the shifts are mod-32. 


Rotate lefta: yo(x<«an) | (x (32-a)) 


Rotate right a: y (xn) | (x = (32 -n)) 


2-15 Double-Length Add/Subtract 


Using one of the expressions shown on page 29 for overflow of 
unsigned addition and subtraction, we can easily implement double- 
length addition and subtraction without accessing the machine's carry 
bit. To illustrate with double-length addition, let the operands be (xi, 
Xo) and (y1, yo), and the result be (z1, zo). Subscript 1 denotes the 
most significant half, and subscript 0 the least significant. We assume 
that all 32 bits of the registers are used. The less significant words are 
unsigned quantities. 


Zo & Xo + Vo 
ce [xg & yo) | Co | Fo) & 2 9)] = 31 


zS X] ł Fi T c 


This executes in nine instructions. The second line can be 
r i . 
č e (Ina PRS . . 
(Eo = Xg)» ermitting a four-instruction solution on, 
machines that have this comparison operator in a form that gives the 


result as a 1 or 0 in a register, such as the "SLTU" (Set on Less Than 
Unsigned) instruction on MIPS [MIPS]. 


Similar code for double-length subtraction (x - y) is 
Zp — Xo Fo 
s — nae ul 
b = LEEN & Fo) | (Xp = Fo) & Zy)] = 31 


axy 


This executes in eight instructions on a machine that has a full set of 

b— (Xn € Po. 
logical instructions. The second line can be (Xp = Vo); 
permitting a four-instruction solution on machines that have the 
"SLTU" instruction. 


Double-length addition and subtraction can be done in five 


instructions on most machines by representing the multiple-length 
data using only 31 bits of the least significant words, with the high- 
order bit being 0 except momentarily when it contains a carry or 
borrow bit. 


2-16 Double-Length Shifts 


Let (x1, Xo) be a pair of 32-bit words to be shifted left or right as if 
they were a single 64-bit quantity, with x; being the most significant 
half. Let (y1, yo) be the result, interpreted similarly. Assume the shift 
amount n is a variable ranging from 0 to 63. Assume further that the 
machine's shift instructions are modulo 64 or greater. That is, a shift 
amount in the range 32 to 63 or -32 to -1 results in an all-O word, 
unless the shift is a signed right shift, in which case the result is 32 
sign bits from the word shifted. (This code will not work on the Intel 
x86 machines, which have mod-32 shifts.) 


Under these assumptions the shift left double operation may be 
accomplished as follows (eight instructions): 


Jex a | xos (32-4) | xy (2-32) 


Vo Xp SH 


The main connective in the first assignment must be or, not plus, to 


ive the correct result when n = 32. If it is known that 0 =n == 32, 
the last term of the first assignment may be omitted, giving a six- 


instruction solution. 


Similarly, a shift right double unsigned operation may be done with 

Vo xn x, «(32-a) | x, == (" — 32) 
U 

Fi t ~] = M. 


Shift right double signed is more difficult, because of an unwanted 
sign propagation in one of the terms. Straightforward code follows: 


if n <32 then py x ya | x, « (32-4) 
else yy — x = (n -32) 


i 
ey = 
yi ex San 


If your machine has the conditional move instructions, it is a simple 
matter to express this in branch-free code, in which form it takes eight 
instructions. If the conditional move instructions are not available, the 
operation may be done in ten instructions by using the familiar device 
of constructing a mask with the shift right signed 31 instruction to 
mask the unwanted sign propagating term: 


Vo xS | x, &(32-n) | [(x, > (n—-32)) & ((32-n) > 31)] 


y ox, >n 


2-17 Multibyte Add, Subtract, Absolute Value 


Some applications deal with arrays of short integers (usually bytes or 
halfwords), and often execution is faster if they are operated on a 
word at a time. For definiteness, the examples here deal with the case 
of four 1-byte integers packed into a word, but the techniques are 
easily adapted to other packings, such as a word containing a 12-bit 
integer and two 10-bit integers, and so on. These techniques are of 
greater value on 64-bit machines, because more work is done in 
parallel. 


Addition must be done in a way that blocks the carries from one byte 
into another. This can be accomplished by the following two-step 
method: 


1. Mask out the high-order bit of each byte of each operand and 
add (there will then be no carries across byte boundaries). 


2. Fix up the high-order bit of each byte with a 1-bit add of the 
two operands and the carry into that bit. 


The carry into the high-order bit of each byte is of course given by the 


high-order bit of each byte of the sum computed in step 1. The 
subsequent similar method works for subtraction: 


Addition 
s & {x & OXTFTFTFIF) + (y & 0x7F7F7F7F) 
s<—((x ® y) & 0x80803080) $s 


Subtraction 
de {x | Ox80808080) —(» & 0x7F7F7F7F) 
d<—((x ® y) | 0x7F7F7F7F)=d 


These execute in eight instructions, counting the load of 0Ox7F7F7F7F, 
on a machine that has a full set of logical instructions. (Change the 
and and or of 0x80808080 to and not and or not, respectively, of 
0x7F7F7F7F.) 


There is a different technique for the case in which the word is 
divided into only two fields. In this case, addition can be done by 
means of a 32-bit addition followed by subtracting out the unwanted 


carry. On page 28 we noted that the expression (x + y) B, p, 
gives the carries into each position. Using this and similar 


observations about subtraction gives the following code for adding/ 
subtracting two halfwords modulo 216 (seven instructions): 


Addition Subtraction 

sexty dex-y 

ce(s ®x® y) & 0x0001 0000 b+ (d®x® y)& 0x00010000 
se-s-ce de—dtb 


Multibyte absolute value is easily done by complementing and adding 
1 to each byte that contains a negative integer (that is, has its high- 
order bit on). The following code sets each byte of y equal to the 
absolute value of each byte of x (eight instructions): 


a <— x & 0x80808080 // Isolate signs. 

b<eas7 // Integer 1 where x is negative. 

m < (a-b) | a // OxFF where x is negative. 

y (x®m)+b // Complement and add | where negative. 


The third line could as well be m =a + a - b. The addition of b in 
the fourth line cannot carry across byte boundaries, because the 


quantity x Ba has a high-order 0 in each byte. 


2-18 Doz, Max, Min 


The "doz" function is "difference or zero," defined as follows, for 
signed arguments: 


ix-y, X22, 
doz(x, y) = A = 
| 0, x<y. 


It has been called "first grade subtraction," because the result is O if 
you try to take away too much. We will use it to implement max(x, y) 
and min(x, y). In this connection it is important to note that doz(x, y) 
can be negative; it is negative if the subtraction overflows. The 
difference or zero function can be used directly to implement the 
Fortran IDIM function, although in Fortran, results are generally 
undefined if overflow occurs. 


There seems to be no very good way to implement doz(x, y), max(x, 
y), and min(x, y) in a branch-free way that is applicable to most 
computers. About the best we can do is to compute doz(x, y) using 
one of the expressions given on page 22 for the x < y predicate, and 
then compute max(x, y) and min(x, y) from it, as follows: 


dex- y 
doz(x, y) = d& [(d=((x ® y)&(d®x))) >31] 
max(x, y) = y + doz(x, y) 


min(x, y) = x—doz(x, y) 


This computes doz(x, y) in seven instructions if the machine has 
equivalence, or eight if not, and it computes max(x, y) or min(x, y) in 
one more instruction. 


The following are unsigned versions of these functions: 


dex-y 
dozu(x, y) = d& a[((4x & y) | ((x=y) & d)) $31] 
maxu(x, y) = y + dozu(x, y) 


minu(x, y) = x —dozu(x, y) 


The IBM RISC/6000 computer, and its predecessor the 801, has doz(x, 
y) provided as a single instruction. It permits computing the max(x, y) 
and min(x, y) of signed integers in two instructions, and is 
occasionally useful in itself. Implementing max(x, y) and min(x, y) 
directly is more costly because the machine would then need paths 
from the output ports of the register file back to an input port, 
bypassing the ALU. 


Machines that have conditional move can get destructive [2] max(x, y) 
and min(x, y) in two instructions. For example, on our full RISC, x = 
max(x, y) can be calculated as follows (we write the target register 
first): 


[2] A destructive operation is one that overwrites one or more of its 
arguments. 


cmplt 2Z,xX,yYy Set z = 1 if x < y, else 0. 


movne xX,Z,Y If z is nonzero, set x = y. 


2-19 Exchanging Registers 


A very old trick is that of exchanging the contents of two registers 
without using a third [IBM]: 


xex@yp 
yeoyOx 
xeax@yp 


This works well on_a two-address machine. The trick also works if p 
is replaced by the = logical operation (complement of exclusive or), 
and can be made to work in various ways with add's and subtract's: 


xexty h E e Ta x y-x 
y ram x = Yy y | y -. x y ec y = x 
xex-y xey-x xexty 


Unfortunately, each of these has an instruction that is unsuitable for a 
two-address machine, unless the machine has "reverse subtract." 


This little trick can actually be useful in the application of double 
buffering, in which two pointers are swapped. The first instruction can 


be factored out of the loop in which the swap is done (although this 
negates the advantage of saving a register): 


Outside the loop: tex @ y 
Inside the loop: wex ®t 
peyoOr 


Exchanging Corresponding Fields of Registers 


The problem here is to exchange the contents of two registers x and y 


wherever a mask bit mj = 1, and to leave x and y unaltered wherever 
mj = 0. By "corresponding" fields, we mean that no shifting is 
required. The 1-bits of m need not be contiguous. The straightforward 
method is as follows: 


xex m) | (y&m) 
yeo(y&m) | (x&m) 


xex 


By using "temporaries" for the four and expressions, this can be seen to 
require seven instructions, assuming that either m or m” can be loaded 
with a single instruction and the machine has and not as a single 
instruction. If the machine is capable of executing the four 
(independent) and expressions in parallel, the execution time is only 
three cycles. 


A method that is probably better (five instructions, but four cycles on 
a machine with unlimited instruction-level parallelism) is shown in 
column (a) below. It is suggested by the "three exclusive or" code for 
exchanging registers. 


(a) (b) (c) 


rexOy xex=y te(xOy)&m 
yeyO(x&m) yo yps(x | m) xex®ti 
rex®y xex=y yey®t 


The steps in column (b) do the same exchange as that of column (a), 
but column (b) is useful if m does not fit in an immediate field but m™ 
does, and the machine has the equivalence instruction. 


Still another method is shown in column (c) above [GLS1]. It also 
takes five instructions (again assuming one instruction must be used to 
load m into a register), but executes in only three cycles on a machine 
with sufficient instruction-level parallelism. 


Exchanging Two Fields of the Same Register 


Assume a register x has two fields (of the same length) that are to be 


swapped, without altering other bits in the register. That is, the object 
is to swap fields B and D, without altering fields A, C, and E, in the 
computer word illustrated below. The fields are separated by a shift 
distance k. 


a 


Straightforward code would shift D and B to their new positions, and 
combine the words with and and or operations, as follows: 


t = (x&m)<=k 


th=(x>h&m 
xX =(x&m') | t, [6 


Here, m is a mask with 1's in field D (and O's elsewhere), and m' is a 
mask with 1's in fields A, C, and E. This code requires nine 
instructions and four cycles on a machine with unlimited instruction- 
level parallelism, allowing for two instructions to load the two masks. 


A method that requires only seven instructions and executes in five 
cycles, under the same assumptions, is shown below [GLS1]. It is 
similar to the code in column (c) on page 39 for interchanging 


corresponding fields of two registers. Again, m is a mask that isolates 
field D. 


i= [x Dix k] mn 
x =xOr, Ot 


mi 
wt 
II 


The idea is that tı contains B Po in position D (and 0's elsewhere), 


and tz contains B Bo in position B. This code, and the : 
straightforward code given earlier, work correctly if B and D are "split 


fields"—that is, if the 1-bits of mask m are not contiguous. 
Conditional Exchange 


The exchange methods of the preceding two sections, which are based 
on exclusive or, degenerate into no-operations if the mask m is 0. 
Hence, they can perform an exchange of entire registers, or of 
corresponding fields of two registers, or of two fields of the same 
register, if m is set to all 1's if some condition c is true, and to all 0's if 
c is false. This gives branch-free code if m can be set up without 
branching. 


2-20 Alternating among Two or More Values 


Suppose a variable x can have only two possible values a and b, and 
you wish to assign to x the value other than its current one, and you 
wish your code to be independent of the values of a and b. For 
example, in a compiler x might be an opcode that is known to be 
either branch true or branch false, and whichever it is, you want to 
switch it to the other. The values of the opcodes branch true and 
branch false are arbitrary, probably defined by a C #define or 
enum declaration in a header file. 


The straightforward code to do the switch is 

if (x == a) x = b; 

else x = a; 

or, as is often seen in C programs, 

x = x= a? dD: a; 

A far better (or at least more efficient) way to code it is either 
xatb-x, or 

xeaDbO@y. 


If a and b are constants, these require only one or two basic RISC 
instructions. Of course, overflow in calculating a + b can be ignored. 


This raises the question: Is there some particularly efficient way to 
cycle among three or more values? That is, given three arbitrary but 
distinct constants a, b, and c, we seek an easy-to-evaluate function f 
that satisfies 


fla) = b, 
f(b) = c, and 
f(c) = a. 


It is perhaps interesting to note that there is always a polynomial for 
such a function. For the case of three constants, 


Equation 2 


fix) = SLODO—b), , A bMr— ce), 5 = o)(x— a) 
(c-a)({c-b) (a-b){a-c)  (b-c)(b-a) 


(The idea is that if x = a, the first and last terms vanish, and the 
middle term simplifies to b, and so on.) This requires 14 arithmetic 
operations to evaluate, and, for arbitrary a, b, and c, the intermediate 
results exceed the computer's word size. But it is just a quadratic; if 
written in the usual form for a polynomial and evaluated using 
Horner's rule, [3] it would require only five arithmetic operations (four 
for a quadratic with integer coefficients, plus one for a final division). 
Rearranging Equation (2) accordingly gives 


[3] Horner's rule simply factors out x. For example, it evaluates the 
fourth-degree polynomial ax4 + bx3 + cx2 + dx + eas x(x(x(ax + 
b) + c) + d) + e. Fora polynomial of degree n it takes n 
multiplications and n additions, and it is very suitable for the 
multiply-add instruction. 


Rx) = l zilla- b)a+ (b-c)b+ (c-a)ch? 


(a-b)(a-c)(b-c) 


+[(a-b)b?+(b-c)c?+(c-a)a?]x 


+[(a—b)a*b + (b—c)b*c +(c-a)ac?]}. 


This is getting too complicated to be interesting (or practical). 


Another method, similar to Equation (2) in that just one of the three 
terms survives, is 


f(x) = ((-(¥ = ¢)) & a) + ((-(x = a)) & b) +((- (x = b)) & c). 


This takes 11 instructions if the machine has the equal predicate, not 
counting loads of constants. Because the two addition operations are 
combining two 0 values with a nonzero, they can be replaced with or 


or exclusive or operations. 


The formula can be simplified by precalculating a - c and b - c, and 
then using [GLS1]: 


I(x) = ((-(* = c)) & (a—c)) + (-(x = a@)) & (b-)) +e, or 
f(x) = ((-(* = c)) & (a@c)) @((-(x = a)) & (b@ce)) Ge. 


Each of these operations takes eight instructions. But on most 
machines these are probably no better than the straightforward C code 
shown below, which executes in four to six instructions for small a, 

b, and c. 


if (x == a) x = b; 
else if (x == b) x = C}; 
else x = a; 


Pursuing this matter, there is an ingenious branch-free method of 
cycling among three values on machines that do not have comparison 
predicate instructions [GLS1]. It executes in eight instructions on most 
machines. 


Because a, b, and c are distinct, there are two bit positions, nı and no, 
where the bits of a, b, and c are not all the same, and where the "odd 
one out" (the one whose bit differs in that position from the other two) 
is different in positions nı and ng. This is illustrated below for the 
values 21, 31, and 20, shown in binary. 


Lolo] c 
11111 
10100 2&8 
F} Ha 


Without loss of generality, rename a, b, and c so that a has the odd 
one out in position nı and b has the odd one out in position ng, as 
shown above. Then there are two possibilities for the values of the bits 
at position nı, namely 


(aps bno €n, ) = (0, 1, 1) or (1, 0, 0). 
two A for the bits ia position no, 01), 


(a, c = (0,1, 0) or(1, 
l Ma y= ( ) ( Fi makes four 
cases in all, and formulas for each of these cases are shown below: 


Similarly, there are 


ty b,, > 


Case 1. (a, bno €n) = (0, 1, 1), (a9 bnp On,) = (0, 1, 0): 


? 


f(x) =x * (a—b)+x, *(c-—a)+b 


n 


Case 2. (a, b, €, ) = (0, 1, 1), (ap, bn, y,) = (1, O, 1): 


f(x) = Xn, * (a-b) +x,,, *(a—c)+(b+c-—a) 
Case 3. (a, bn €n) = (1, 0, 0), (an, b, r? ©n,) = (0, 1, 0): 

S(x) = xp *(b-a) +x, *(c-a)+a 
Case 4. (ano bno Cn) = (1, 0, 0), (ap Ons Cn) = C1, 0, 1): 


f(x) = x, *(b- a)+x, *(a-c)+e 


In these formulas, the left operand of each multiplication is a single 
bit. A multiplication by 0 or 1 may be converted into an and with a 
value of 0 or all 1's. Thus, the formulas can be rewritten as illustrated 
below for the first formula: 


f(x) = ((x « (31-n,)) > 31) &(a— b)+ ((x < (31-n,)) > 31)&(c- a) + b 


Because all variables except x are constants, this can be evaluated in 
eight instructions on the basic RISC. Here again, the additions and 
subtractions can be replaced with exclusive or. 


This idea can be extended to cycling among four or more constants. 
The essence of the idea is to find bit positions nj, ng, ..., at which the 
bits uniquely identify the constants. For four constants, three bit 
positions always suffice. Then (for four constants) solve the following 
equation for s, t, u, and v (that is, solve the system of four linear 


Xx 
anois in which f(x) is a, b, c, or d, and the coefficients “are 0 
or 1): 


f(x) ~ A, 4 - Xal + at Mai TY 


If the four constants are uniquely identified by only two bit positions, 
the equation to solve is 


f(x) = x, 5 + Xpt + Xg Enl +r. 


Chapter 3. Power-of-2 Boundaries 
Rounding Up/Down to a Multiple of a Known Power of 2 
Rounding Up/Down to the Next Power of 2 


Detecting a Power-of-2 Boundary Crossing 


3-1 Rounding Up/Down to a Multiple of a Known Power of 
2 


Rounding an unsigned integer x down to, for example, the next 
smaller multiple of 8, is trivial: x & -8 does it. An alternative is 


ae tl ODD ee 
(X a> I) E "These work for signed integers as well, provided 
"round down" means to round in the negative direction (e.g., (-37) & 


(-8) = -40). 


Rounding up is almost as easy. For example, an unsigned integer x can 
be rounded up to the next greater multiple of 8 with either of 


(x+7)&-8, or 
x+(—x & 7). 


These expressions are correct for signed integers as well, provided 
"round up" means to round in the positive direction. The second term 
of the second expression is useful if you want to know how much you 
must add to x to make it a multiple of 8 [Gold]. 


To round a signed integer to the nearest multiple of 8 toward 0, you 
can combine the two expressions above in an obvious way: 


teixa 31)&7: 
(x+r)&—-8 


i fi 
fy ~z ~z, AC 
An alternative for the first line is j = (x — 2 ) aa 29 *which is 
useful if the machine lacks and immediate, or if the constant is too 


large for its immediate field. 


Sometimes the rounding factor is given as the logs of the alignment 
amount (e.g., a value of 3 means to round to a multiple of 8). In this 
case, code such as the following may be used, where k = 
logo(alignment amount): 


round down: x &((-1) <« k) 
(xk) k 
round up: te (le kj-l; (x+H&-1 
te (-l)& k; (x-t-l)&t 


3-2 Rounding Up/Down to the Next Power of 2 


We define two functions that are similar to floor and ceiling, but 
which are directed roundings to the closest integral power of 2, rather 
than to the closest integer. Mathematically, they are defined by 


undefined, x < 0, undefined, x <0, 
flp2(x) = 40, x = 0, clp2(x) = 40, x= 0, 
allos] otherwise; 2 1082:"] otherwise. 


The initial letters of the function names are intended to suggest "floor" 


and "ceiling." Thus, flp2(x) is the greatest power of 2 that is £r, and 
clp2(x) is the least power of 2 that is =x. These definitions make 
sense even when x is not an integer (e.g., flp2(0.1) = 0.0625). The 
functions satisfy several relations analogous to those involving floor 
and ceiling, such as those shown below, where n is an integer. 


Lx] = [x] iff x isan integer flp2(x) = clp2(x) iff x is a power of 2 or is 0 
Lx+n]=Lxjtan flp2(2"x) = 2"flp2(x) 
[x] = -L-x] clp2(x) = 1/flp2(1/x), x40 


Computationally, we deal only with the case in which x is an integer, 
and we take it to be unsigned, so the functions are well defined for all 
x. We require the value computed to be the arithmetically correct 
value modulo 232 (that is, we take clp2(x) to be 0 for x > 231). The 
functions are tabulated below for a few values of x. 


x flp2(x) clp2(x) 
0 0) 

l l l 

2 2 2 

3 2 4 

4 4 4 

5 4 8 


931 31 93! 
331 +] 331 ü 
232-1 231 0 


Functions flp2 and clp2 are connected by the relations shown below. 
These can be used to compute one from the other, subject to the 
indicated restrictions. 


clp2(x) = 2flp2%x-1),  x#l, 
flp2(2.x — 1), Ixi, 
elp2ixZ2+1), x#0, 


I 


flp2(x) 


clp2(v+1)22, x< 231, 


The round-up and round-down functions can be computed quite easily 
with the number of leading zeros instruction, as shown below. 
However, for these relations to hold for x = 0 and x > 231, the 


computer must have its shift instructions defined to produce 0 for shift 
amounts of -1, 32, and 63. Many machines (e.g., PowerPC) have "mod 
64" shifts, which do this. In the case of -1, it is adequate if the 
machine shifts in the opposite direction (that is, a shift left of -1 
becomes a shift right of 1). 


flp2(x) = 1 < (31 - nlz{x)) 

l =< (nlz(x) ® 31) 

0x80000000 = nlz(x) 

I << (32 — nlz(x — 1)) 
0x80000000 = (nlz(x —1)—1) 


II 


ll 


clp2(x) 


Rounding Down 


Figure 3-1 illustrates a branch-free algorithm that might be useful if 
number of leading zeros is not available. This algorithm is based on 
right-propagating the leftmost 1-bit, and executes in 12 instructions. 


Figure 3-1 Greatest power of 2 less than or equal to x, branch-free. 


unsigned flp2(unsigned x) { 


xX = X (x >> 1); 
Xx = xX (x >> 2); 
xXx = X (x >> 4); 
x = xX (x >> 8); 
xX = xX (x >>16); 
return x - (x >> 1); 


} 


Figure 3-2 shows two simple loops that compute the same function. 
All variables are unsigned integers. The loop on the right keeps 


turning off the rightmost 1-bit of x until x = 0, and then returns the 
previous value of x. 


Figure 3-2 Greatest power of 2 less than or equal to x, simple loops. 


y = 0x80000000; do { 
while (y > x) y = Xi 

y = y >> 1; x=x& (x - 1); 
return } while(x != 0); 


return y; 


The loop on the left executes in 4nlz(x) + 3 instructions. The loop on 


the right, for x Ho, executes in 4pop(x) instructions, [1] if the 
comparison to 0 is zero-cost. 


[1] pop(x) is the number of 1-bits in x. 
Rounding Up 


The right-propagation trick yields a good algorithm for rounding up to 
the next power of 2. This algorithm, shown in Figure 3-3, is branch- 
free and runs in 12 instructions. 


Figure 3-3 Least power of 2 greater than or equal to x. 


unsigned clp2(unsigned x) { 


x = x - 1; 

x = xX (x >> 1); 
x = X (x >> 2); 
x = X (x >> 4); 
xX = xX (x >> 8); 
x = xX (x >>16); 


return x + 1; 


} 


An attempt to compute this with the obvious loop does not work out 
very well: 


y= 1; 


while (y < x) // Unsigned comparison. 
y = 2*y; 
return y; 


This code returns 1 for x = 0, which is probably not what you want, 


loops forever for x = 931, and executes in 4n +3 instructions, where 
n is the power of 2 of the returned integer. Thus, it is slower than the 


>s g S 


ee code, in terms of instructions executed, for n =3 (x = 
8). 


3-3 Detecting a Power-of-2 Boundary Crossing 


Assume memory is divided into blocks that are a power of 2 in size, 
starting at address 0. The blocks may be words, doublewords, pages, 
and so on. Then, given a starting address a and a length 1, we wish to 


determine whether or not the address range from a toa + 1-1, 1 =>, 
crosses a block boundary. The quantities a and | are unsigned and any 


values that fit in a register are possible. 


If 1 = 0 or 1, a boundary crossing does not occur, regardless of a. If 1 
exceeds the block size, a boundary crossing does occur, regardless of 
a. For very large values of 1 (wraparound is possible), a boundary 
crossing can occur even if the first and last bytes of the address range 
are in the same block. 


There is a surprisingly concise way to detect boundary crossings on 
the IBM System/370 [CJS]. This method is illustrated below for a 
block size of 4096 bytes (a common page size). 


O RA, =A (-4096) 
ALR RA,RL 
BO CROSSES 


The first instruction forms the logical or of RA (which contains the 
starting address a) and the number OxFFFFFOOO. The second 
instruction adds in the length, and sets the machine's 2-bit condition 
code. For the add logical instruction, the first bit of the condition code 
is set to 1 if a carry occurred, and the second bit is set to 1 if the 32- 
bit register result is nonzero. The last instruction branches if both bits 
are set. At the branch target, RA will contain the length that extends 
beyond the first page (this is an extra feature that was not asked for). 


If, for example, a = 0 and 1 = 4096, a carry occurs but the register 
result is 0, so the program properly does not branch to label CROSSES. 


Let us see how this method can be adapted to RISC machines, which 
generally do not have branch on carry and register result nonzero. 
Using a block size of 8 for notational simplicity, the method of [CJS] 


branches to CROSSES if a carry occurred ((a | -8) + 1 = 932) and the 


register result is nonzero ((a | -8) + 1 #232). Thus, it is equivalent to 
the predicate 


(a | -8)+/> 2°, 


This in turn is equivalent to getting a carry in the final addition in 
evaluating ((a | -8) -1) + 1. If the machine has branch on carry, this 
can be used directly, giving a solution in about five instructions 
counting a load of the constant -8. 


If the machine does not have branch on carry, we can use the fact that 
_axty iffax = 

carry occurs in ; 

on page 29) to obtain the expression 


=((a | -8)-1) < 


F (see "Unsigned Add/Subtract" 


Using various identities such as —(x - 1) = -x gives the following 
equivalent expressions for the "boundary crossed" predicate: 


—(a | -8)</ 
ala | —-B)+12/ 
(-~a&7)+1 <i 


These can be evaluated in five or six instructions on most RISC 
computers. 


Using another tack, clearly an 8-byte boundary is crossed if 


(a& 7)+/-—128. 


This cannot be directly evaluated because of the possibility of 
overflow (which occurs if l is very large), but it is easily rearranged to 
8 - (a & 7) < 1, which can be directly evaluated on the computer (no 
part of it overflows). This gives the expression 


8—(a@& T) =i, 


which can be evaluated in five instructions on most RISCs (four if it 
has subtract from immediate). If a boundary crossing occurs, the 
length that extends beyond the first block is given by 1 - (8 - (a & 7)) 
which can be calculated with one additional instruction (subtract). 


Chapter 4. Arithmetic Bounds 
Checking Bounds of Integers 
Propagating Bounds through Add's and Subtract's 


Propagating Bounds through Logical Operations 


4-1 Checking Bounds of Integers 


By "bounds checking" we mean to verify that an integer x is within 
two bounds a and b—that is, that 


asxsob. 


We first assume that all quantities are signed integers. 


An important application is the checking of array indexes. For 
example, suppose a one-dimensional array A can be indexed by values 
from 1 to 10. Then, for a reference A(i), a compiler might generate 
code to check that 


lsisl10 


and to branch or trap if this is not the case. In this section we show 
that this check can be done with a single comparison, by performing 
the equivalent check [PL8]: 


i-1<9. 


This is probably better code, because it involves only one compare- 
branch (or compare-trap), and because the quantity i - 1 is probably 
needed anyway for the array addressing calculations. 


Does the implementation 


atxib—x-atb-a 


always work, even if overflow may occur in the subtractions? It does, 


provided we somehow know that a Sp. In the case of array bounds 
checking, language rules may require that an array not have a number 


of elements (or number of elements along any axis) that are 0 or 
negative, and this rule can be verified at compile time or, for dynamic 
extents, at array allocation time. In such an environment, the 


transformation above is correct, as we will now show. 


It is convenient to use a lemma, which is good to know in its own 
right. 


Lemma. If a and b are signed integers and a Sp, then the computed 
value b - a correctly represents the arithmetic value b - a, if the 


computed value is interpreted as unsigned. 


Proof. (Assume a 32-bit machine.) Because a Sp, the true difference 
b - a is in the range 0 to (23! - 1) - (-23!) = 232 - 1. If the true 


difference is in the range 0 to 231 - 1, then the machine result is 
correct (because the result is representable under signed 
interpretation), and the sign bit is off. Hence the machine result is 
correct under either signed or unsigned interpretation. 


If the true difference is in the range 23! to 232 - 1, then the machine 
result will differ by some multiple of 232 (because the result is not 
representable under signed interpretation). This brings the result 
(under signed interpretation) to the range -231 to -1. The machine 
result is too low by 232, and the sign bit is on. Reinterpreting the 
result as unsigned increases it by 232, because the sign bit is given a 
weight of + 23! rather than -231, Hence the reinterpreted result is 
correct. 


The "bounds theorem" is 


Theorem. If a and b are signed integers and a Sp, then 


Equation 1 


asx<sh = x-atb-a. 


Proof. We distinguish three cases, based on the value of x. In all cases, 


by the lemma, since a Sp, the computed value b - a is equal to the 
arithmetic value b - a if b - a is interpreted as unsigned, as it is in 


Equation (1). 


Case 1, x < a: In this case, x - a interpreted as unsigned is x - a + 232, 
Whatever the values of x and b are (within the range of 32-bit 
numbers), 


x + 232 > p, 


Therefore 


x-at2¥%sh-a, 


and hence 


x-a>b-a. 


In this case, both sides of Equation (1) are false. 


Case 2,a £ Zp: Then, arithmetically, x - a <p - a. Because a S 
by the lemma x - a equals the computed value x - a if the latter is 


interpreted as unsigned. Hence 


x-a<h-a: 


that is, both sides of Equation (1) are true. 


Case 3, x > b: Then x -a > b - a. Because in this case x > a (because 
b > a), by the lemma x - a equals the value of x - a if the latter is 
interpreted as unsigned. Hence 


x-asth-a: 


that is, both sides of Equation (1) are false. 


The theorem stated above is also true if a and b are unsigned integers. 
This is because for unsigned integers the lemma holds trivially, and 
the above proof is also valid. 


Below is a list of similar bounds-checking transformations, with the 
one of the theorem above stated again. These all hold for either signed 


or unsigned interpretation of a, b, and x. 

Equation 2 

ifashbthenasxsh = x-ath—a = b-xéb-a 
ifashbthenasx<hb=x-atb-a 
ifashbthena<xsbhb = b-xtbh-a 


ifa<bthena<x<b = x-a-1é26-a-1 = b-x-1<£5b-a-1 


In the last rule, b - a - 1 may be replaced with b + ~a. 


There are some quite different transformations that may be useful 


< < 


when the test is of the form -27-1 —x —2n-1 - 1. This is a test to see if 
a signed quantity x can be correctly represented as an n-bit two's- 


complement integer. To illustrate with n = 8, the following tests are 
equivalent: 


a -128 <x <127 

b. x+128 £255 

3 (x= 7)+1¢1 

d. xæ] = xX > 31 

o. (x >> 7) + (x31) = 


£ (x << 24) = 24 = 


x ® (x 31) 5 127 
Equation (b) is simply an application of the preceding material in this 
section. Equation (c) is as well, after shifting x right seven positions. 
Equations (c)-(f) and possibly (g) are probably useful only if the 
constants in Equations (a) and (b) exceed the size of the immediate 
fields of the computer's compare and add instructions. 


Another special case involving powers of 2 is 


O<x52"-1o(x>n) =0, 


or, more generally, 


asxezat2"-lLe((x-a)=n)=0. 


4-2 Propagating Bounds through Add's and Subtract's 


Some optimizing compilers perform "range analysis" of expressions. 
This is the process of determining, for each occurrence of an 
expression in a program, upper and lower bounds on its value. 
Although this optimization is not a really big winner, it does permit 
improvements such as omitting the range check on a C "switch" 
statement and omitting some subscript bounds checks that compilers 
may provide as a debugging aid. 


Suppose we have bounds on two variables x and y as follows, where 
all quantities are unsigned: 


Equation 3 
asxeh, and 


cHyed, 


Then, how can we compute tight bounds on x + y, x - y, and -x? 


< < 


Arithmetically, of course, a + c =x + y =b + d; but the point is 
that the additions may overflow. 


The way to calculate the bounds is expressed in the following: 


Theorem. If a, b, c, d, x, and y are unsigned integers and 
a<x<b and 


esysd, 


then 


Equation 4 
O<xty<22-1 if atcs2"-1 and b+d2=2*, 


ate<xt y <bh+d otherwise; 


Equation 5 


O<x—y<22-1 if a-d<0 and b-c20, 


a—~d<x ysb c otherwise; 


Equation 6 
O2£-ve2"-1 if a=0 and b#0, 


be —-x%-a otherwise. 


Inequalities (4) say that the bounds on x + y are "normally" a + c 
and b + d, but if the calculation of a + c does not overflow and the 
calculation of b + d does overflow, then the bounds are 0 and the 
maximum unsigned integer. Equations (5) are interpreted similarly, 
but the true result of a subtraction being less than 0 constitutes an 
overflow (in the negative direction). 


Proof. If neither a + c nor b + d overflows, then x + y with x and y 
in the indicated ranges, cannot overflow, making the computed results 
equal to the true results, so the second inequality of (4) holds. If both 
a + candb + d overflow, then so also does x + y. Now 
arithmetically, it is clear that 


ate-22ex+typ-22 <h+ d—2%, 


This, however, is what is calculated when the three terms overflow. 
Hence in this case also, 


atecextysbht+d. 


If a + c does not overflow but b + d does, then 


ates2%2—-] and 6+d>2°2, 


Because x + y takes on all values in the range a + c to b + d, it takes 
on the values 232 - 1 and 232— that is, the computed value x + y takes 
on the values 232 - 1 and 0 (although it doesn't take on all values in 
that range). 


Lastly, the case that a + c overflows but b + d does not cannot occur, 


because a Sp and c Sa. 
This completes the proof of inequalities (4). The proof of (5) is similar, 
but "overflow" means that a true difference is less than 0. 


Inequalities (6) can be proved by using (5) with a = b = 0, and then 
renaming the variables. (The expression -x with x an unsigned number 
means to compute the value of 232 - x, or of -x + 1 if you prefer.) 


Because unsigned overflow is so easy to recognize (see "Unsigned 
Add/Subtract" on page 29), these results are easily embodied in code, 
as shown in Figure 4-1 for addition and subtraction. The computed 
lower and upper limits are variables s and t, respectively. 


Figure 4-1 Propagating unsigned bounds through addition and subtraction 
operations. 


s =at cC} s =a- d; 

t = b +d; t =b- c; 

if (s >= a && t < b) { if (s > a && t <= 
s = 0; s = 0; 
t = OxFFFFFFFF; } t = OxFFFFFFFF; } 


Signed Numbers 


The case of signed numbers is not so clean. As before, suppose we 
have bounds on two variables x and y as follows, where all quantities 
are signed: 


Equation 7 


asxeh, and 


ceived. 


We wish to compute tight bounds on x + y, x - y, and -x. The 
reasoning is very similar to that for the case of unsigned numbers, and 
the results for addition are shown below. 


Equation 8 


ate<—23l,b+d<-2l:atesxtysbtd 
at+c<-—23!,b+d2-23!: -23l <x+y<21-1 
Bleatce<BV b+d<Blratesxtysbtd 
2l<sate<23lb+d223!; -Blext+ ps2sl-1 


atc22 b+d2>Blratesxt+ysbtd 


The first row means that if both of the additions a + candb + d 
overflow in the negative direction, then the computed sum x + y lies 
between the computed sums a + c and b + d. This is because all 
three computed sums are too high by the same amount (232). The 
second row means that if the addition a + c overflows in the negative 
direction, and the addition b + d either does not overflow or 
overflows in the positive direction, then the computed sum x + y can 
take on the extreme negative number and the extreme positive 
number (although perhaps not all values in between), which is not 
difficult to show. The other rows are interpreted similarly. 


The rules for propagating bounds on signed numbers through the 


subtraction operation can easily be derived by rewriting the bounds 
on y as 


=d £ -y £ E 


and using the rules for addition. The results are shown below. 


a-—d<-—23!,b-c<-23!:a-—dsx-ysb-c 
a-—d<-23!,b-c2—23!: -21 <x-y<2!-1 
23leq—d<23!,b—c<23!:a—-dex-ysb-e 
231 ¢a-—d<23!,b-c225!; 2 <x-ys27!-1 


a-—d223',b-c225!:a-dsx-ysb-e 


The rules for negation can be derived from the rules for subtraction by 
taking a = b = 0, omitting some impossible combinations, 
simplifying, and renaming. The results are as follows: 


a =—23|, p =-23! ‘iy = ZI 
qa=—23| be—25l- Ie -—x<s 251 _-] 


a#—25|:-h<e-x<¢-a 


tt 


C code for the case of signed numbers is a bit messy. We will consider 
only addition. It seems to be simplest to check for the two cases in (8) 
in which the computed limits are the extreme negative and positive 
numbers. Overflow in the negative direction occurs if the two 
operands are negative and the sum is nonnegative (see "Signed Add/ 
Subtract" on page 26). Thus, to check for the condition thata + c < 
-231, we could let s = a + c; and then code something like "if (a 
< 0 && c < 0 && s >= 0) ...." It will be more efficient, [1] 
however, to perform logical operations directly on the arithmetic 
variables, with the sign bit containing the true/false result of the 
logical operations. Then, we write the above condition as "if ((a & 
c & ~s) < 0) ...." These considerations lead to the program 
fragment shown in Figure 4-2 below. 


[1] In the sense of more compact, less branchy, code; faster-running 
code may result from checking first for the case of no overflow, 
assuming the limits are not likely to be large. 


Figure 4-2 Propagating signed bounds through an addition operation. 


s=a+tc; 


t= bd 


aé&cé& ~s& ~(b & d & ~t); 


c 
ll 


N 


v= (a ^c) | ~(a* s)) & (~b & ~d & t); 
if (w |v) <0) { 


0x80000000; 


S 


E Ox7FFFFFFF; } 


Here u is true (sign bit is 1) if the addition a + c overflows in the 
negative direction, and the addition b + d does not overflow in the 
negative direction. Variable v is true if the addition a + c does not 
overflow and the addition b + d overflows in the positive direction. 
The former condition can be expressed as "a and c have different 
signs, or aand s have the same sign." The "if" test is equivalent to 
"if (u < 0 || v < 0)—thatis, if either u or v is true." 


4-3 Propagating Bounds through Logical Operations 


As in the preceding section, suppose we have bounds on two variables 
x and y as follows, where all quantities are unsigned: 


Equation 9 
a=x=A, and 


csyed. 


Be what are some reasonably tight bounds on x | y, x & y, x p ; 
and ~x? 


Combining inequalities (9) with some inequalities from Section 2-3 on 
page 16, and noting that -x = 232 - 1 - x, yields 


max(a,c)S(x | y)sb+d, 
O0<(x & y) minh, d). 
D(x p)sh+d, and 


nabani ad, 


where it is assumed that the addition b + d does not overflow. These 
are easy to compute and might be good enough for the compiler 
application mentioned in the preceding section. The bounds in the 
first two inequalities, however, are not tight. For example, writing 
constants in binary, suppose 


Equation 10 


00010 = x = 00100, and 
01001 = p = 10100. 


Then, by inspection (e.g., trying all 36 possibilities for x and y), we 


see that 01010 So |y) &10111. Thus, the lower bound is not 


max(a, c) nor is it a | c, and the upper bound is not b + d, nor is it b | 
d. 


Given the values of a, b, c, and d in inequalities (9), how can one 
obtain tight bounds on the logical expressions? Consider first the 
minimum value attained by x | y. A reasonable guess might be the 
value of this expression with x and y both at their minima—that is, a | 
c. Example (10), however, shows that the minimum can be lower than 
this. 


To find the minimum, our procedure is to start with x = a and y = c, 
and then find an amount by which to increase either x or y so as to 
reduce the value of x | y. The result will be this reduced value. Rather 
than assigning a and c to x and y, however, we work directly with a 
and c, increasing one of them when doing so is valid and it reduces 
the value of a | c. 


The procedure is to scan the bits of a and c from left to right. If both 
bits are 0, the result will have a 0 in that position. If both bits are 1, 
the result will have a 1 in that position (clearly, no values of x and y 
could make the result less). In these cases, continue the scan to the 
next bit position. If one scanned bit is 1 and the other is 0, then it is 
possible that changing the 0 to 1 and setting all the following bits in 
that bound's value to 0 will reduce the value of a | c. This change will 
not increase the value of a | c, because the result has a 1 in that 
position anyway, from the other bound. Therefore, form the number 
with the 0 changed to 1 and subsequent bits changed to 0. If that is 
less than or equal to the corresponding upper limit, the change can be 
made; do it, and the result is the or of the modified value with the 
other lower bound. If the change cannot be made (because the altered 
value exceeds the corresponding upper bound), continue the scan to 
the next bit position. 


That's all there is to it. It might seem that after making the change, the 
scan should continue, looking for other opportunities to further reduce 
the value of a | c. However, even if a position is found that allows a 0 
to be changed to 1, setting the subsequent bits to 0 does not reduce 
the value of a | c, because those bits are already 0. 


C code for this algorithm is shown in Figure 4-3. We assume that the 
compiler will move the subexpressions ~a & cand a & ~c out of 
the loop. More significantly, if the number of leading zeros instruction 
is available, the program can be speeded up by initializing m with 


m = 0x80000000 >> nlz(a ^ c); 
Figure 4-3 Minimum value of x | y with bounds on x and y. 


unsigned minOR(unsigned a, unsigned b, 
unsigned c, unsigned d) { 


unsigned m, temp; 


m = 0x80000000; 


while (m != 0) { 


if (temp <= b) {a = temp; break; } 


else if (a & ~c & m) { 
temp = (c | m) & -m; 


if (temp <= d) {c = temp; break; } 


return a G} 


} 


This skips over initial bit positions in which a and c are both 0 or 
both 1. For this speedup to be effective when a ^ c is 0 (that is, 
when a = c), the machine's shift right instruction should be mod 64. 
If number of leading zeros is not available, it may be worthwhile to 
use some version of the flp2 function (see page 46) with argument a 
& Gy 


Now let us consider the maximum value attained by x | y, with the 
variables bounded as shown in inequalities (9). The algorithm is 
similar to that for the minimum, except it scans the values of bounds b 
and d (from left to right), looking for a position in which both bits are 
1. If such a position is found, the algorithm tries to increase the value 
of c | d by decreasing one of the bounds by changing the 1 to 0, and 
setting all subsequent bits in that bound to 1. If this is acceptable (if 
the resulting value is greater than or equal to the corresponding lower 
bound), the change is made and the result is the value of c | d using 
the modified bound. If the change cannot be done, it is attempted on 
the other bound. If the change cannot be done to either bound, the 
scan continues. C code for this algorithm is shown in Figure 4-4. Here 
the subexpression b & dcan be moved out of the loop, and the 
algorithm can be speeded up by initializing m with 


mx = 0x80000000 >> nlz(b & d); 
Figure 4-4 Maximum value of x | y with bounds on x and y. 


unsigned maxOR(unsigned a, unsigned b, 
unsigned c, unsigned d) { 


unsigned m, temp; 
m = 0x80000000; 
while (m != 0) { 


if (b& d & m) { 


if (temp >= a) {b = temp; break; } 


if (temp >= c) {d = temp; break; } 


return b | d; 
} 


There are two ways in which we might propagate the bounds of 
inequalities (9) through the expression x & y: algebraic and direct 
computation. The algebraic method uses DeMorgan's rule: 


x&y = -a(4x/ ay) 


Because we know how to propagate bounds precisely through or, and 
it is trivial to propagate them through not 


(@EX Sb AbSAXE AG Jae have 


minAND(a, b, c, d) = ~maxOR(—b, =a, nd, ~c), and 
maxAND(a, b, c, d) = ~minOR(—+, ~na, nd, ~e). 


For the direct computation method, the code is very similar to that for 
propagating bounds through or. It is shown in Figures 4-5 and 4-6. 


Figure 4-5 Minimum value of x & y with bounds on x and y. 


unsigned minAND (unsigned a, unsigned b, 
unsigned c, unsigned d) { 


unsigned m, temp; 


m = 0x80000000; 


if (temp <= b) {a = temp; break; } 


if (temp <= d) {c = temp; break; } 


return a & cC}; 


Figure 4-6 Maximum value of x & y with bounds on 


x and y. 


unsigned maxAND (unsigned a, unsigned b, 


unsigned c, unsigned d) { 


unsigned m, temp; 


m = 0x80000000; 


while (m != 0) { 


temp = (b & ~m) | (m =- 1); 


if (temp >= a) {b = temp; 


else if (~b & d & m) { 


temp = (d & ~m) | (m =- 1); 


if (temp >= c) {d = temp; 


return b & d; 


break; } 


break; } 


} 


The algebraic method of finding bounds on expressions in terms of the 
functions for and, or, and not works for all the binary logical 
expressions except exclusive or and equivalence. The reason these two 
present a difficulty is that when expressed in terms of and, or, and 
not, there are two terms containing x and y. For example, we are to 
find 


min (x®y) = min ((x&-y¥) | (Ax & y)). 
EEEE. EEE 
espsd csysd 


The two operands of the or cannot be separately minimized (without 
proof that it works, which actually it does) because we seek one value 
of x and one value of y that minimizes the whole or expression. 


The following expressions may be used to propagate bounds through 
exclusive or: 


minXOR(a, b, c, d) = minAND(a, b, ~d, ~c) | minAND(-—4, ~a, c, d), 
maxXOR(a, b, c, d) = maxOR(0, maxAND(a, b, ~nd, ~c), 
0, maxAND(~—b, ~a, c, d)). 


It is straightforward to evaluate the minXOR and maxXOR functions 
by direct computation. The code for minXOR is the same as that for 
minOR (Figure 4-3) except with the two break statements removed, 
and the return value changed to a ^ c. The code for maxXOR is the 
same as that for maxOR (Figure 4-4) except with the four lines under 
the if clause replaced with 


temp = (b - m) | (m - 1); 
if (temp >= a) b = temp; 
else { 

temp = (d - m) | (m- 1); 


if (temp >= c) d = temp; 


and the return value changed to b ^ d. 
Signed Bounds 


If the bounds are signed integers, propagating them through logical 
expressions is substantially more complicated. The calculation is 
irregular if 0 is within the range a to b, or c to d. One way to calculate 
the lower and upper bounds for the expression x | y is shown in Table 
4-1. A"+" entry means that the bound at the top of the column is 
greater than or equal to 0, and a "-" entry means that it is less than 0. 
The column labelled "minOR (signed)" contains expressions for 
computing the lower bound of x | y and the last column contains 
expressions for computing the upper bound of x | y. One way to 
program this is to construct a value ranging from 0 to 15 from the sign 
bits of a, b, c, and d, and use a "switch" statement. Notice that not all 
values from 0 to 15 are used, because it is impossible to have a > b or 
c>d. 


For signed numbers, the relation 


axb nabani na 


holds, so the algebraic method can be used to extend the results of 
Table 4-1 to other logical expressions (except for exclusive or and 
equivalence). We leave this and similar extensions to others. 


Table 4-1. Signed minOR And maxOR from Unsigned 
maxOR signed) 
minOR(@, b, c, d) 
at 
mhinOR(@, b, c, d) 
et 


max@REY, b, 0, d) 
minOR(8, OxEFHFFFFF, c, d) 
mmOR(a, b, c, d) 
rhaxOR(a, b, ©, GYFFFFFFFF) 
mmOR(a, b, c, d) 


Chapter 5. Counting Bits 


Counting 1-Bits 
Parity 
Counting Leading 0's 


Counting Trailing 0's 


5-1 Counting 1-Bits 


The IBM Stretch computer (ca. 1960) had a means of counting the 
number of 1-bits in a word as well as the number of leading 0's. It 
produced these two quantities as a by-product of all logical 
operations! The former function is sometimes called population count 
(e.g., on Stretch and the SPARCV9). 


For machines that don't have this instruction, a good way to count the 
number of 1-bits is to first set each 2-bit field equal to the sum of the 
two single bits that were originally in the field, and then sum adjacent 
2-bit fields, putting the results in each 4-bit field, and so on. A more 
complete discussion of this trick is in [RND]. The method is illustrated 
in Figure 5-1, in which the first row shows a computer word whose 1- 
bits are to be summed, and the last row shows the result (23 decimal). 


Figure 5-1. Counting 1-bits, "divide and conquer" strategy. 


LOLTLTITILOOOTTOOOLITOLTILIVIOLILITIII1S 


Pop op oe os op oP op opp oof afro 


0011/0010001 00010001 1/001 10100 


00001000 
00D0DDDDVDDDDDOTOO01/0O00000000000T1 110 


00D0D0DDDDDDVDD0D00000000000000010111 


This is an example of the "divide and conquer" strategy, in which the 
original problem (summing 32 bits) is divided into two problems 
(summing 16 bits), which are solved separately, and the results are 
combined (added, in this case). The strategy is applied recursively, 
breaking the 16-bit fields into 8-bit fields, and so on. 


In the case at hand, the ultimate small problems (summing adjacent 
bits) can all be done in parallel, and combining adjacent sums can also 
be done in parallel in a fixed number of steps at each stage. The result 
is an algorithm that can be executed in loge(32) = 5 steps. 


Other examples of divide and conquer are the well-known technique 
of binary search, a sorting method known as quicksort, and a method 
for reversing the bits of a word, discussed on page 101. 


The method illustrated in Figure 5-1 may be committed to C code as 


x (x & 0x55555555) + ((x >> 1) & 0x55555555); 
x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 
x = (x & OxOFOFOFOF) + ((x >> 4) & OxOFOFOFOF); 
x = (x & OxOOFFOOFF) + ((x >> 8) & OxOOFFOOFF); 


x = (x & OxXOOOOFFFF) + ((x >>16) & OxOOOOFFFF) ; 


The first line uses (x >> 1) & 055555555 rather than the 
perhaps more natural (x & OxAAAAAAAA) >> 1 because the code 
shown avoids generating two large constants in a register. This would 
cost an instruction if the machine lacks the and not instruction. A 
similar remark applies to the other lines. 


Clearly, the last and is unnecessary, and other and's may be omitted 
when there is no danger that a field's sum will carry over into the 
adjacent field. Furthermore, there is a way to code the first line that 
uses one fewer instruction. This leads to the simplification shown in 
Figure 5-2, which executes in 21 instructions and is branch-free. 


Figure 5-2 Counting 1-bits in a word. 


int pop (unsigned x) { 
x = x - ((x >> 1) & 0x55555555); 
x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 
x = (x + (x >> 4)) & OxOFOFOFOF; 
x = Xx + (x >> 8); 


x = x + (x >> 16); 


return x & 0x0000003F; 


} 


The first assignment to x is based on the first two terms of the rather 
surprising formula 


Equation 1 
f t A 
(x) = x¥-|=]-]: 
ports) = x-]*|-|: 


In Equation (1), we must have x 2o. By treating x as an unsigned Ţ 
integer, Equation (1) can be implemented with a sequence of 31 shift 


right immediate's of 1, and 31 subtract's. The procedure of Figure 5-2 
uses the first two terms of this on each 2-bit field, in parallel. 


f | 
| 
| 
m 
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There is a simple proof of Equation (1), which is shown below for the 
case of a 4-bit word. Let the word be b3b2b1bọ, where each bj = 0 or 
1. Then, 


x-[#/-[2/-[2] = b,-234+b,-224+b,- 2! 4b, + 20 


—(b, : 2? +b,-2!'+b,- 2°) 
-= (b; +2! +b, + 2°) 
—(b, E 20) 
= b,(23-2?-2! — 20) + b,(2? - 2! — 29) + b (2! — 2) + ba(2?) 


= b, + b, +b, + bo. 


Alternatively, Equation (1) can be derived by noting that bit i of the 
binary representation of a nonnegative integer x is given by 


b= | = |= 3] = 
! Hi F atl 


and summing this for i = 0 to 31. Work it out—the last term is 0 
because x < 232, 


Equation (1) generalizes to other bases. For base ten it is 
sum_digits(x) = x-9 + |-9] + |-... 
10 100 


where the terms are carried out until they are 0. This can be proved 
by essentially the same technique used above. 


A variation of the above algorithm is to use a base four analogue of 
Equation (1) as a substitute for the second executable line of Figure 
5-2: 


x = X — 3*((x >> 2) & 0x33333333) 


This code, however, uses the same number of instructions as the line it 
replaces (six), and requires a fast multiply-by-3 instruction. 


An algorithm in HAKMEM memo [HAK, item 169] counts the number 
of 1-bits in a word by using the first three terms of (1) to produce a 
word of 3-bit fields, each of which contains the number of 1-bits that 
were in it. It then adds adjacent 3-bit fields to form 6-bit field sums, 
and then adds the 6-bit fields by computing the value of the word 
modulo 63. Expressed in C, the algorithm is (the long constants are in 
octal) 


int pop (unsigned x) { 


unsigned n; 


n = (x >> 1) & 033333333333; // Count bits in 
x=x- n; // each 3-bit 

n = (n >> 1) & 033333333333; // field. 

x= X- nN}; 


x = (x + (x >> 3)) & 030707070707; // 6-bit sums. 


x = modu(x, 63); // Add 6-bit sums. 
return xX; 


} 


The last line uses the unsigned modulus function. (It could be either 
signed or unsigned if the word length were a multiple of 3). That the 
modulus function sums the 6-bit fields becomes clear by regarding the 
word x as an integer written in base 64. The remainder upon dividing 


a base b eee by b - 1 is, for b > 3, congruent mod b to the sum of 
the digits and, of course, is less than b. Because the sum of the digits 


in this case must be less than or equal to 32, mod(x, 63) must be equal 
to the sum of the digits of x, which is to say equal to the number of 1- 
bits in the original x. 


This algorithm requires only ten instructions on the DEC PDP-10, 
because that machine has an instruction for computing the remainder 
with its second operand directly referencing a fullword in memory. On 
a basic RISC, it requires about 13 instructions, assuming the machine 
has unsigned modulus as one instruction (but not directly referencing 
a fullword immediate or memory operand). But it is probably not very 
fast, because division is almost always a slow operation. Also, it 
doesn't apply to 64-bit word lengths by simply extending the 
constants, although it does work for word lengths up to 62. 


A variation on the HAKMEM algorithm is to use Equation (1) to count 
the number of 1's in each 4-bit field, working on all eight 4-bit fields 
in parallel [Hay1]. Then, the 4-bit sums may be converted to 8-bit 
sums in a straightforward way, and the four bytes can be added with a 
multiplication by 0x01010101. This gives 


int pop (unsigned x) { 


unsigned n; 


n = (x >> 1) & 0x77777177; // Count bits in 
Xx =xX- n; // each 4-bit 
n= (n >> 1) & OxX77777777; // field. 


xX = xX - n; 


n= (n >> 1) & OxX77777777; 
xX = X - n; 


x = (x + (x >> 4)) & OxOFOFOFOF; // Get byte sums. 


x x*0x01010101; // Add the bytes. 


return x >> 24; 


} 


This is 19 instructions on the basic RISC. It works well if the machine 
is two-address, because the first six lines can be done with only one 
move register instruction. Also, the repeated use of the mask 
0x77777777 permits loading it into a register and referencing it with 
register-to-register instructions. Furthermore, most of the shifts are of 
only one position. 


A quite different bit-counting method, illustrated in Figure 5-3, is to 
turn off the rightmost 1-bit repeatedly [Weg, RND], until the result is 
O. It is very fast if the number of 1-bits is small, taking 2 + 5pop(x) 
instructions. 


Figure 5-3 Counting 1-bits in a sparsely populated word. 


int pop (unsigned x) { 


int n; 

n = 0; 

while (x != 0) { 
n=n + 1; 


} 


return n; 


} 


This has a dual algorithm that is applicable if the number of 1-bits is 


expected to be large. The dual algorithm keeps turning on the 
rightmost 0-bit with x = x | (x + 1), until the result is all 1's 
(-1). Then, it returns 32 - n. (Alternatively, the original number x can 
be complemented, or n can be initialized to 32 and counted down). 


A rather amazing algorithm is to rotate x left one position, 31 times, 
adding the 32 terms [MM]. The sum is the negative of pop(x)! That is, 


Equation 2 


where the additions are done modulo the word size, and the final sum 
is interpreted as a two's-complement integer. This is just a novelty; it 
would not be useful on most machines because the loop is executed 31 
times and thus it requires 63 instructions plus the loop-control 
overhead. 


To see why Equation (2) works, consider what happens to a single 1- 
bit of x. It gets rotated to all positions, and when these 32 numbers 
are added, a word of all 1-bits results. But this is -1. To illustrate, 
consider a 6-bit word size and x = 001001 (binary): 


001001 x 

010010 xz | 
100100 Me! 
001001 x3 
010010 XE 
100100 xe5 


E 


Of course, rotate-right would work just as well. 


The method of Equation (1) is very similar to this "rotate and sum" 


method, which becomes clear by rewriting (1) as 


(x >i), 


pop(x) = x 


U eal 


This gives a slightly better algorithm than Equation (2) provides. It is 
better because it uses shift right, which is more commonly available 
than rotate, and because the loop can be terminated when the shifted 
quantity becomes 0. This reduces the loop-control code and may save 
a few iterations. The two algorithms are contrasted in Figure 5-4. 


Figure 5-4 Two similar bit-counting algorithms. 


int pop (unsigned x) { 


int i, sum; 


// Rotate and sum method // Shift right & subtract 
sum = x; // sum = x; 
for (i = 1; i <= 31; i++) { // while (x != 0) { 
x = rotatel(x, 1); // x= x >> 1; 
sum = sum + x; // sum = sum —- x; 
} // } 
return -sum; // return sum; 


} 


A less interesting algorithm that may be competitive with all the 
algorithms for pop(x) in this section is to have a table that contains 
pop(x) for, say, x in the range 0 to 255. The table can be accessed four 
times, adding the four numbers obtained. A branch-free version of the 
algorithm looks like this: 


int pop(unsigned x) { // Table lookup. 


static char table[256] = { 


return table[x & OxFF] + 
table[(x >> 8) & OxFF] + 


table[(x >> 16) & OxFF] + 


table[(x >> 24)]; 
} 


Item 167 in [HAK] contains a short algorithm for counting the number 
of 1-bits in a 9-bit quantity that is right-adjusted and isolated in a 
register. It works only on machines with registers of 36 or more bits. 
Below is a version of that algorithm that works on 32-bit machines, 
but only for 8-bit quantities. 


x = x * 0x08040201; // Make 4 copies. 
x = x >> 3; // So next step hits proper bits. 
x = x & Ox11111111; // Every 4th bit. 


x = x * Oxl1111111; // Sum the digits (each 0 or 1). 


x x >> 28; // Position the result. 

A version for 7-bit quantities is 

x = x * 0x02040810; // Make 4 copies, left-adjusted. 
x = x & Ox11111111; // Every 4th bit. 

x = x * Oxl1111111; // Sum the digits (each 0 or 1). 


x >> 28; // Position the result. 


x 
ll 


In these, the last two steps may be replaced with steps to compute the 
remainder of x modulo 15. 


These are not particularly good; most programmers would probably 
prefer to use table lookup. The latter algorithm above, however, has a 
version that uses 64-bit arithmetic, which might be useful for a 64-bit 
machine that has fast multiplication. Its argument is a 15-bit quantity. 
(I don't believe there is a similar algorithm that deals with 16-bit 
quantities, unless it is known that not all 16 bits are 1.) The data type 
long long is a GNU C extension [Stall], meaning twice as long as an 
int, in our case 64 bits. The suffix ULL makes unsigned long 
long constants. 


int pop(unsigned x) { 
unsigned long long y; 


y = x * 0x0002000400080010ULL; 


y = y & 0x1111111111111111ULL; 
y= y * 0x1111111111111111ULL; 
y = y >> 60; 


return y; 


Counting 1-Bits in an Array 


The simplest way to count the number of 1-bits in an array of 
fullwords, in the absence of the population count instruction, is to use 
a procedure such as that of Figure 5-2 on page 66 on each word of the 
array, and simply add the results. 


Another way, which may be faster, is to use the first two executable 
lines of that procedure on groups of three words of the array, adding 
the three partial results. Because each partial result has a maximum 
value of 4 in each 4-bit field, adding three of these partial results gives 
a word with at most 12 in each 4-bit field, so no sum overflows into 
the field to its left. Next, each of these partial results may be 
converted into a word having four 8-bit fields with a maximum value 
of 24 in each field, using 


x = (x & OxOFOFOFOF) + ((x >> 4) & OxOFOFOFOF); 


As these words are produced, they may be added until the maximum 
value is just less than 255; this would allow summing ten such words 


( L255/ 24 J )-when ten such words have been added, the result 


may be converted into a word having two 16-bit fields, with a 
maximum value of 240 in each, with 


x = (x & OxXOOFFOOFF) + ((x >> 8) & OxOOFFOOFF) ; 


ee Ai 
Lastly, 273 such words ( L655 35/ 240 J Nean be added together 
until it is necessary to convert the sum to a word consisting of just one 
32-bit field, with 

x = (x & OxXOOOOFFFF) + (x >> 16); 


In practice, the instructions added for loop control significantly 
detract from what is saved, so it is probably overkill to follow this 
procedure to the extent described. The code of Figure 5-5 applies the 
idea with only one intermediate level. First, it produces words 
containing four 8-bit partial sums. Then, after these are added 
together as much as possible, a fullword sum is produced. The number 
of words of 8-bit fields that can be added with no danger of overflow 


51 = 
s L25578] = 31. 
Figure 5-5 Counting 1-bits in an array. 


int pop_array (unsigned A[], int n) { 
int i, j, lim; 


unsigned s, s8, x; 


for (i = 0; i < n; i = i + 31) { 


lim = min(n, i + 31); 


x = A[jl; 
x = x - ((x >> 1) & 0x55555555); 
x = (x & 0x33333333) + ((x >> 2) & 0x33333333); 


(x + (x >> 4)) & OxOFOFOFOF; 


x 
ll 


s8 = s8 + x; 


x = (s8 & OxXOOFFOOFF) + ((s8 >> 8) & OxOOFFOOFF) ; 


(x & Ox0000ffff) + (x >> 16); 


x 
ll 


S = S + X} 
} 


return s; 


} 


This algorithm was compared to the simple loop method by compiling 
the two procedures with GCC to a target machine that is very similar 
to the basic RISC. The result is 22 instructions per word for the simple 
method, and 17.6 instructions per word for the method of Figure 5-5, 
a savings of 20%. 


Applications 


An application of the population count function is in computing the 
"Hamming distance" between two bit vectors, a concept from the 
theory of error-correcting codes. The Hamming distance is simply the 
number of places where the vectors differ; that is, 


dist(x, y) = pop(x ® y). 


See, for example, the chapter on error-correcting codes in [Dewd]. 


Another application is to allow reasonably fast direct-indexed access 
to a moderately sparse array A that is represented in a certain 
compact way. In the compact representation, only the defined, or 


nonzero, elements of the array are stored. There is an auxiliary bit 
string array bits of 32-bit words, which has a 1-bit for each index i for 
which A[i] is defined. As a speedup device, there is also an array of 
words bitsum such that bitsum[j] is the total number of 1-bits in all 
the words of bits that precede entry j. This is illustrated below for an 
array in which elements 0, 2, 32, 47, 48, and 95 are defined. 


bits bitsum data 
Ox00000005 0 A|0] 
0x0001 8001 2 A[2] 
Oxs80000000 5 A[32]| 
A[47] 
A[48] 
A[95] 
=< 


Given an index i, 0 =i 95, the corresponding index sparse_i into 
the data array is given by the number of 1-bits in array bits that 


precede the bit corresponding to i. This may be calculated as follows: 


J= į So 5s // j = 1/32. 

k =i & 31; // k = rem(i, 32); 

mask = 1 << k; // A "1" at position k. 
if ((bits[j] & mask) == 0) goto no_such_element; 
mask = mask - 1; // 1's to right of k. 


sparse_i = bitsum[j] + pop(bits[j] & mask); 
The cost of this representation is two bits per element of the full array. 


Still another application of the population function is in computing 
the number of trailing 0's in a word (see "Counting Trailing O's" on 
page 84). 


5-2 Parity 


The "parity" of a string refers to whether it contains an odd or an even 
number of 1-bits. The string has "odd parity" if it contains an odd 
number of 1-bits; otherwise, it has "even parity." 


Computing the Parity of a Word 


Here we mean to produce a 1 if a word x has odd parity, and a 0 if it 
has even parity. This is the sum, modulo 2, of the bits of x—that is, 
the exclusive or of all the bits of x. 


One way to compute this is to compute pop(x); the parity is the 
rightmost bit of the result. This is fine if you have the population 
count instruction, but if not, there are better ways than using the code 
for pop(x). 


A rather direct method is to compute 


ai- | 
E ` 
ye @ (x => Pr), 
f= 


where n is the word size, and then the parity of x is given by the 


rightmost bit of y. (Here "denotes exclusive or, but for this formula, 
ordinary addition could be used.) 


The parity may be computed much more quickly, for moderately large 
n, as follows (illustrated for n = 32; the shifts can be signed or 
unsigned): 


Equation 3 

yY = x (x >> LL); 
y=y " (¥ => 2); 
yey” ty >> 4); 
y=y” (yY >> 8); 
Y= y ^ (y >>16); 


This executes in ten instructions, as compared to 62 for the first 
method, even if the implied loop is completely unrolled. Again, the 
parity bit is the rightmost bit of y. In fact, with either of these, if the 
shifts are unsigned, then bit i of y gives the parity of the bits of x at 
and to the left of i. Furthermore, because exclusive or is its own 


inverse, Xi B, is the parity of bits i - 1 through j, for i >j, 

This is an example of the "parallel prefix," or "scan" operation, which 
has applications in parallel computing [KRS; HS]. Given a sufficient 
number of processors, it can convert certain seemingly serial processes 
from O(n) to O(log n) time. For example, if you have an array of 
words and you wish to compute the exclusive or scan operation on the 
entire array of bits, you can first use (3) on each word of the array, 
and then use essentially the same technique on the array, doing 
exclusive or's on the words of the array. This takes more elementary 
(word length) exclusive or operations than a simple left-to-right 
process, and hence it is not a good idea for a uniprocessor. But on a 
parallel computer with a sufficient number of processors, it can do the 
job in O(logz2 n) rather than O(n) time (where n is the number of 
words in the array). 


A direct application of (3) is the conversion of an integer to Gray code 
(see page 236). 


If the code (3) is changed to use left shifts, the parity of the whole 
word x winds up in the leftmost bit position, and bit i of y gives the 
parity of the bits of x at and to the right of position i. 


If rotate shift's are used, the result is a word of all 1's if the parity of x 
is odd, and of all O's if even. 


The following method executes in nine instructions and computes the 
parity of x as the integer 0 or 1 (the shifts are unsigned). 


x = x ^ (x >> 1); 


x = (x ^ (x >> 2)) & 0x11111111; 


x*0x11111111; 


x 
ll 


p = (x >> 28) & 1; 


After the second statement above, each hex digit of x is 0 or 1, 
according to the parity of the bits in that hex digit. The multiply adds 
these digits, putting the sum in the high-order hex digit. There can be 
no carry out of any hex column during the add part of the multiply, 


because the maximum sum of a column is 8. 


The multiply and shift could be replaced by an instruction to compute 
the remainder after dividing x by 15, giving a (slow) solution in eight 
instructions, if the machine has remainder immediate. 


Adding a Parity Bit to a 7-Bit Quantity 


Item 167 in [HAK] contains a novel expression for putting even parity 
on a 7-bit quantity that is right-adjusted and isolated in a register. By 
this we mean to set the bit to the left of the seven bits, to make an 8- 
bit quantity with even parity. Their code is for a 36-bit machine, but it 
works on a 32-bit machine as well. 


modu((x + 0x10204081) & Ox888888FF, 1920) 


Here, modu(a, b) denotes the remainder of a upon division by b, with 
the arguments and result interpreted as unsigned integers, "*" denotes 
multiplication modulo and the constant 1920 is 15 -27. Actually, this 
computes the sum of the bits of x, and places the sum just to the left of 
the seven bits comprising x. For example, the expression maps 
0x0000007F to 0x000003FF, and 0x00000055 to 0x00000255. 


Another ingenious formula from [HAK] is the following, which puts 
odd parity on a 7-bit integer: 


modu((x + 0x00204081) | 0x3DB6DBOO, 1152), 


where 1152 = 9- 27. To understand this, it helps to know that the 
powers of 8 are + 1 modulo 9. If the Ox3DB6DBO00 is changed to 
OxBDB6DBO00, this formula applies even parity. 


These methods are not practical on today's machines, because memory 


is cheap but division is still slow. Most programmers would compute 
these functions with a simple table lookup. 


Application 


The parity operation is useful in multiplying bit matrices in GF(2) (in 
which the add operation is exclusive or). 


5-3 Counting Leading 0's 


There are several simple ways to count leading 0's with a binary 
search technique. Below is a model that has several variations. It 
executes in 20 to 29 instructions on the basic RISC. The comparisons 
are "logical" (unsigned integers). 


if (x == 0) return(32); 


if (x <= OxOOOOFFFF) {n = n +16; x = x <<16;} 
if (x <= OxOOFFFFFF) {n = n+ 8; x = x << 8;} 
if (x <= OxOFFFFFFF) {n =n+ 4; x = x << 4;} 
if (x <= Ox3FFFFFFF) {n = n + 2; x = x << 2;} 
if (x <= Ox7FFFFFFF) {n = n + 1;} 

return nj; 

One variation is to replace the comparisons with and's: 


if ((x & OxFFFF0000) == 0) {n = n +16; x = x <<16;} 


if ((x & OxFFO00000) == 0) {n =n + 8; x x << 8} 


Another variation, which avoids large immediate values, is to use shift 
right instructions. 


The last if statement is simply adding 1 to n if the high-order bit of 
x is 0, so an alternative, which saves a branch instruction, is: 


n=n + 1 -— (x >> 31); 


The "+ 1" in this assignment can be omitted if n is initialized to 1 
rather than to 0. These observations lead to the algorithm (12 to 20 
instructions on the basic RISC) shown in Figure 5-6. A further 
improvement is possible for the case in which x begins with a 1-bit: 
change the first line to 


if ((int)x <= 0) return (~x >> 26) & 32; 


Figure 5-6 Number of leading zeros, binary search. 


int nlz(unsigned x) { 


int n; 

if (x == 0) return(32); 

n= 1; 

if ((x >> 16) == 0) {n =n +16; x = x <<16;} 
if ((x >> 24) == 0) {n =n + 8; x =x << 8;} 
if ((x >> 28) == 0) {n= n + 4; x =x << 4;} 
if ((x >> 30) == 0) {n =n + 2; x =x << 2;} 
n=n - (x >> 31); 


return n; 


} 
Figure 5-7 illustrates a sort of reversal of the above. It requires fewer 
operations the more leading 0's there are, and avoids large immediate 


values and large shift amounts. It executes in 12 to 20 instructions on 
the basic RISC. 


Figure 5-7 Number of leading zeros, binary search, counting down. 


int nlz(unsigned x) { 


unsigned y; 


int n; 

n = 32; 

y = x >>16; if (y != 0) {n= n -16; x = y;} 
y = x >> 8; if (y != 0) {n=n- 8; x = y;} 


y = x >> 2; if (y != 0) {n =n - 2; x= y;} 
y =x >> 1; if (y != 0) return n - 2; 
return n - xX; 


} 


This algorithm is amenable to a "table assist": the last four executable 
lines can be replaced by 


static char table[256] = {0,1,2,2,3,3,3,3,4,4,...,8); 
return n - table[x]; 


Many algorithms can be aided by table lookup, but this will not often 
be mentioned here. 


For compactness, this and the preceding algorithms in this section can 
be coded as loops. For example, the algorithm of Figure 5-7 becomes 


the algorithm shown in Figure 5-8. This executes in 23 to 33 basic 
RISC instructions, ten of which are conditional branches. 


Figure 5-8 Number of leading zeros, binary search, coded as a loop. 


int nlz (unsigned x) { 
unsigned y; 


int n, c; 


n = 32; 

c = 16; 

do { 
y = x >> C} if (y != 0) {n =n - GC}; x = y;} 
c = c >> 1; 


} while (c != 0); 


return n - xX; 


} 


One can, of course, simply shift left one place at a time, counting, 
until the sign bit is on; or shift right one place at a time until the word 
is all 0. These algorithms are compact and work well if the number of 
leading 0's is expected to be small or large, respectively. One can 
combine the methods, as shown in Figure 5-9. We mention this 
because the technique of merging two algorithms and choosing the 
result of whichever one stops first is more generally applicable. It 
leads to code that runs fast on superscalar and VLIW machines, 
because of the proximity of independent instructions. (These machines 
can execute two or more instructions simultaneously, provided they 
are independent.) 


Figure 5-9 Number of leading zeros, working both ends at the same time. 


int nlz(int x) { 


int y, n; 


n = 0; 
y = Xi 
L: if (x < 0) return n; 
if (y == 0) return 32 - n; 


n =n + 1; 


} 


On the basic RISC, this executes in min(3 + 6nlz(x), 5 + 6(32 - 
nlz(x))) instructions, or 99 worst case. However, one can imagine a 
superscalar or VLIW machine executing the entire loop body in one 
cycle if the comparison results are obtained as a by-product of the 


shifts, or in two cycles otherwise, plus the branch overhead. 


It is straightforward to convert either of the algorithms of Figure 5-6 
or Figure 5-7 to a branch-free counterpart. Figure 5-10 shows a 
version that does the job in 28 basic RISC instructions. 


Figure 5-10 Number of leading zeros, branch-free binary search. 


int nlz (unsigned x) { 


int y, m, n; 


y = -(x >> 16); // 

m = (y >> 16) & 16; // 

n= 16 - m; // 

x = x >> m; // 
// 

y = x — 0x100; // 

m= (y >> 16) & 8; // 

n=n+m; 

x =x << m; 

y = x — 0x1000; // 

m= (y >> 16) & 4; // 

n=n+m; 

x =x << m; 

y = x — 0x4000; // 

m= (y >> 16) & 2; // 


If left half of x is 0, 

set n = 16. If left half 

is nonzero, set n = 0 and 
shift x right 16. 

Now x is of the form O0000xxxx. 
If positions 8-15 are 0, 


add 8 to n and shift x left 8. 


If positions 12-15 are 0, 


add 4 to n and shift x left 4. 


If positions 14-15 are 0, 


add 2 to n and shift x left 2. 


n=n +m; 

x =x << m; 

y= x >> 14; // Set y = 0, 1, 2, or 3. 

m = y & ~(y >> 1); // Set m= 0, 1, 2, or 2 resp. 


return n + 2 - m; 


} 


If your machine has the population count instruction, a good way to 
compute the number of leading zeros function is given in Figure 5-11. 
The five assignments to x may be reversed, or in fact done in any 
order. This is branch-free and takes 11 instructions. Even if population 
count is not available, this algorithm may be useful. Using the 21- 
instruction code for counting 1-bits given in Figure 5-2 on page 66, it 
executes in 32 branch-free basic RISC instructions. 


Figure 5-11 Number of leading zeros, right-propagate and count 1-bits. 


int nlz (unsigned x) { 


int pop (unsigned x); 


x = xX (x >> 1); 
x =x (x >> 2); 
x =x (x >> 4); 
x = xX (x >> 8); 
Xx = xX (x >>16); 


return pop (~x); 


Floating-Point Methods 


The floating-point post-normalization facilities can be used to count 
leading zeros. It works out quite well with IEEE-format floating-point 
numbers. The idea is to convert the given unsigned integer to double- 
precision floating-point, extract the exponent, and subtract it from a 
constant. Figure 5-12 illustrates a complete procedure for this. 


Figure 5-12 Number of leading zeros, using IEEE floating-point. 
int nlz(unsigned k) { 
union { 
unsigned asInt[2]; 


double asDouble; 


int n; 
asDouble = (double)k + 0.5; 
n = 1054 - (asInt [LE] >> 20); 


return n; 


} 


The code uses the C+ + "anonymous union" to overlay an integer 
with a double-precision floating-point quantity. Variable LE must be 
1 for execution on a little-endian machine, and 0 for big-endian. The 
addition of 0.5, or some other small number, is necessary for the 
method to work when k = 0. 


We will not attempt to assess the execution time of this code, because 
machines differ so much in their floating-point capabilities. For 
example, many machines have their floating-point registers separate 
from the integer registers, and on such machines data transfers 
through memory may be required to convert an integer to floating- 
point and then move the result to an integer register. 


The code of Figure 5-12 is not valid C or C+ + according to the ANSI 
standard, because it refers to the same memory locations as two 
different types. Thus, one cannot be sure it will work on a particular 


machine and compiler. It does work with IBM's XLC compiler on AIX, 
and with the GCC compiler on AIX and on Windows 2000, at all 
optimization levels (as of this writing, anyway). If the code is altered 
to do the overlay defining with something like 


xx = (double)k + 0.5; 
n = 1054 - (*((unsigned *)&xx + LE) >> 20); 


it does not work on these systems with optimization turned on. This 
code, incidentally, violates a second ANSI standard, namely, that 
pointer arithmetic can be performed only on pointers to array 
elements [Cohen]. The failure, however, is due to the first violation, 
involving overlay defining. 


In spite of the flakiness of this code, [1] three variations are given 
below. 


[1] The flakiness is due to the way C is used. The methods illustrated 
would be perfectly acceptable if coded in machine language, or 
generated by a compiler, for a particular machine. 


asDouble = (double)k; 

n = 1054 - (asInt [LE] >> 20); 
n= (n & 31) + (n >> 9); 

k = k & ~(k >> 1); 

asFloat = (float)k + 0.5f; 

n = 158 - (asInt >> 23); 

k = k & ~(k >> 1); 

asFloat = (float)k; 

n = 158 - (asInt >> 23); 

n= (n & 31) + (n >> 6); 


In the first variation, the problem with k = 0 is fixed not by a 


floating-point addition of 0.5, but by integer arithmetic on the result 
n (which would be 1054, or 0x41E, if the correction were not done). 


The next two variations use single-precision floating-point, with the 
"anonymous union" changed in an obvious way. Here there is a new 
problem: Rounding can throw off the result when the rounding mode 
is either round to nearest (almost universally used) or round toward 

+ 06. For round to nearest mode, the rounding problem occurs for k 
in the ranges hexadecimal FFFFFF80 to FFFFFFFF, 7FFFFFCO to 
7FFFFFFF, 3FFFFFEO to 3FFFFFFF, and so on. In rounding, an add of 1 
carries all the way to the left, changing the position of the most 
significant 1-bit. The correction steps used above clear the bit to the 
right of the most significant 1-bit, blocking the carry. 


The GNU C/C+ + compiler has a unique feature that allows coding 
any of these schemes as a macro, giving in-line code for the function 
references [Stall]. This feature allows statements, including 
declarations, to be inserted in code where an expression is called for. 
The sequence of statements would usually end with an expression, 
which is taken to be the value of the construction. Such a macro 
definition is shown below, for the first single-precision variation. (In 
C, it is customary to use uppercase for macro names.) 


#define NLZ(k) \ 


({union {unsigned _asInt; float _asFloat;}; \ 


unsigned _kk = (k) & ~((unsigned) (k) >> 1); \ 
_asFloat = (float)_kk + 0.5f; \ 
158 - (_asInt >> 23);}) 


The underscores are used to avoid name conflicts with parameter k; 
presumably, user-defined names do not begin with underscores. 


Relation to the Log Function 


The "nlz" function is, essentially, the "integer log base 2" function. For 
unsigned x Ho, 


| log, (x) | 
| log,(x) | 


31 —nlz(x), and 


32 — nlz{x — 1). 


See also Section 11- 4, "Integer Logarithm," on page 215. 


Another closely related function is bitsize, the number of bits required 
to represent its argument as a signed quantity in two's-complement 
form. We take its definition to be 


l, x = -lor0O 
2, x = —20rl 
bitsize(x) 3; 4<xs-30r2exs3 
ditsize(x) = 
4, 8<xs-—5or4<x<7 
32, —231¢x7<-2594+1 or 2S x<25!-1. 


From this definition, bitsize(x) = bitsize(- x - 1). But - x - 1 = ~x, so 
an algorithm for bitsize is (where the shift is signed) 


x = x ^ (x >> 31); // If (x< 0) x = -x - 1; 


return 33 = nlz(x); 
Applications 


Two important applications of the number of leading zeros function 
are in simulating floating-point arithmetic operations and in various 
division algorithms (see Figure 9-1 on page 141, and Figure 9-3 on 

page 152). The instruction seems to have a miscellany of other uses. 


It can be used to get the "x = y " predicate in only three instructions 
(see "Comparison Predicates" on page 21), and as an aid in computing 
certain elementary functions (see pages 205, 208, 214, and 218). 


A novel application is to generate exponentially distributed random 
integers by generating uniformly distributed random integers and 
taking "nlz" of the result [GLS1]. The result is O with probability 1/2, 
1 with probability 1/4, 2 with probability 1/8, and so on. Another 
application is as an aid in searching a word for a consecutive string of 
1-bits (or 0-bits) of a certain length, a process that is used in some 
disk block allocation algorithms. For these last two applications, the 
number of trailing zeros function could also be used. 


5-4 Counting Trailing 0's 


If the number of leading zeros instruction is available, then the best 
way to count trailing 0's is, most likely, to convert it to a count 
leading 0's problem: 


32 —nlz(—x & (x—1)). 


If population count is available, a slightly better method is to form a 
mask that identifies the trailing 0's, and count the 1-bits in it [Hay2], 
such as 


pop(—x & (x=-1)), and 
32 —pop(x | —x). 


Variations exist using other expressions for forming a mask that 
identifies the trailing zeros of x, such as those given in Section 
2-1,"Manipulating Rightmost Bits" on page 11. These methods are also 
reasonable even if the machine has none of the bit-counting 
instructions. Using the algorithm for pop(x) given in Figure 5-2 on 
page 66, the first expression above executes in about 3 + 21 = 24 
instructions (branch-free). 


Figure 5-13 shows an algorithm that does it directly, in 12 to 20 basic 
RISC instructions (for x Fo). 


Figure 5-13 Number of trailing zeros, binary search. 


int ntz(unsigned x) { 


int n; 
if (x == 0) return(32); 
n= 1; 


if ((x & OxOOOOFFFF) == 0) {n = n +16; x = x >>16;} 


if ((x & OxOOOOOOFF) == 0) {n =n + 8; x = x >> 8;} 
if ((x & Ox0000000F) == 0) {n =n + 4; x = x >> 4;} 
if ((x & 0x00000003) == 0) {n =n + 2; x = x >> 2;} 
return n - (x & 1); 


} 


The n + 16 can be simplified to 17 if that helps, and if the compiler 
is not smart enough to do that for you (this does not affect the number 
of instructions as we are counting them). 


Figure 5-14 shows a variation that uses smaller immediate values and 
simpler operations. It executes in 12 to 21 basic RISC instructions. 
Unlike the above procedure, when the number of trailing 0's is small, 
the one in Figure 5-14 executes a larger number of instructions, but 
also a larger number of "fall through" branches. 


Figure 5-14 Number of trailing zeros, smaller immediate values. 


int ntz (unsigned x) { 


unsigned y; 


int n; 

if (x == 0) return 32; 

n= Sis 

y = x <<16; if (y != 0) {n= n -16; x = y;} 
y = x << 8; if (y != 0) {n =n - 8; x = y;} 
y =x << 4; if (y != 0) {n =n - 4; x = y;} 
y =x << 2; if (y != 0) {n=sn- 2; x = y;} 
y =x << 1; if (y != 0) {n =n - 1;} 


return n; 


The line just above the return statement may alternatively be coded 
n=n - ((x << 1) >> 31); 
which saves a branch, but not an instruction. 


In terms of number of instructions executed, it is hard to beat the 
"search tree" [Aus2]. Figure 5-15 illustrates this procedure for an 8-bit 
argument. This procedure executes in seven instructions for all paths 
except the last two (return 7 or 8), which require nine. A 32-bit 
version would execute in 11 to 13 instructions. Unfortunately, for 
large word sizes the program is quite large. The 8-bit version above is 
12 lines of executable source code and would compile into about 41 
instructions. A 32-bit version would be 48 lines and about 164 
instructions, and a 64-bit version would be twice that. 


Figure 5-15 Number of trailing zeros, binary search tree. 


int ntz(char x) { 
if (x & 15) { 
if (x & 3) { 
if (x & 1) return 0; 
else return 1; 
} 
else if (x & 4) return 2; 
else return 3; 
} 
else if (x & 0x30) { 
if (x & 0x10) return 4; 
else return 5; 
} 
else if (x & 0x40) return 6; 


else if (x) return 7; 


else return 8; 


} 
If the number of trailing 0's is expected to be small or large, then the 
simple loops below are quite fast. The left-hand algorithm executes in 


5 + 3ntz(x) and the right one in 3 + 3(82 - ntz(x)) basic RISC 
instructions. 


Figure 5-16 Number of trailing zeros, simple counting loops. 


int ntz (unsigned x) { 


int n; 

x = ~x & (x - 1); 

n = 0; // n = 32; 

while(x != 0) { // while (x != 0) { 
n =n + 1; // n=n-1; 
x = x >> 1; // x = X + X}; 

} tts 

return n; // return n; 


} 


It is interesting to note that if the numbers x are uniformly distributed, 
then the average number of trailing 0's is, very nearly, 1.0. To see this, 
sum the products pini, where p; is the probability that there are 
exactly n; trailing O's. That is, 


(aes) eevee a Ba, 


l | 
2 4 & 16 32 64 


lt 


To evaluate this sum, consider the following array: 


l4 178 1/716 1/32 1/64 
1/8 1/16 1/32 1/64 
1/16 1/32 1/64 
1732 1/64 

Irog 


The sum of each column is a term of the series for S. Hence S is the 
sum of all the numbers in the array. The sum of the rows are 


[/4+1/84+ 1/164 1/327... = 1/2 
1/8+1/164+1/324+ 1/64+... = 1/4 
1/16 + 1/32 + 1/64+ 1/128 +... = 1/8 


and the sum of these is 1/2 + 1/4 + 1/8 + ... = 1. The absolute 
convergence of the original series justifies the rearrangement. 


Sometimes, a function similar to ntz(x) is wanted, but a 0 argument is 
a special case, perhaps an error, that should be identified with a value 
of the function that's easily distinguished from the "normal" values of 

the function. For example, let us define "the number of factors of 2 in 

x" to be 


ntz(x), x0, 


fact2(x) = 
sai 1, x=0. 


This can be calculated from 


31 —nlz(x & —x). 


Applications 


[GLS1] points out some interesting applications of the number of 
trailing zeros function. It has been named the "ruler function" by Eric 
Jensen, because it gives the height of a tick mark on a ruler that's 
divided into halves, quarters, eighths, and so on. 


It has an application in R. W. Gosper's loop-detection algorithm, which 
will now be described in some detail because it is quite elegant and it 
does more than might at first seem possible. 


Suppose a sequence Xo, X1, X2, ... is defined by Xn + 1 = f(X»). If the 
range of f is finite, the sequence is necessarily periodic. That is, it 
consists of a leader Xo, X1, ..., Xy-1 followed by a cycle Xu Xu +1, «.. 
X, + 1-1 that repeats without limit (Ky = Xu +l, Xp +1 = Xp+1+1, 
and so on, where l is the period of the cycle). Given the function f, the 
loop-detection problem is to find the index m of the first element that 
repeats, and the period l. Loop detection has applications in testing 
random number generators and detecting a cycle in a linked list. 


One could save all the values of the sequence as they are produced, 
and compare each new element with all the preceding ones. This 
would immediately show where the second cycle starts. But 
algorithms exist that are much more efficient in space and time. 


Perhaps the simplest is due to R. W. Floyd [Knu2, sec. 3.1, prob. 6]. 
This algorithm iterates the process 


x = f(x) 
y= fC FY) 


with x and y initialized to Xo. After the nth step, x = Xn and y = Xon. 
These are compared, and if equal, it is known that Xn and Xən are 
separated by an integral multiple of the period |—that is, 2n -n = n is 
a multiple of 1. Then m can be determined by regenerating the 
sequence from the beginning, comparing Xo to Xn, then X1 to Xn+1, 
and so on. Equality occurs when X, is compared to Xp + y. Finally, 1 


can be determined by regenerating more elements, comparing X,, to Xy 
+1, Xu + 2, -.. This algorithm requires only a small and bounded 
amount of space, but it evaluates f many times. 


Gosper's algorithm [HAK, item 132; Knu2, Answers to Exercises for 
sec. 3.1, prob. 7] finds the period 1, but not the starting point m of the 
first cycle. Its main feature is that it never backs up to reevaluate f, 
and it is quite economical in space and time. It is not bounded in 
space; it requires a table of size logo(L) + 1, where L is the largest 
possible period. This is not a lot of space; for example, if it is known a 


priori that L S932, then 33 words suffice. 


Gosper's algorithm, coded in C, is shown in Figure 5-17. This C 
function is given the function f being analyzed and a starting value Xo. 
It returns lower and upper bounds on m, and the period l. (Although 
Gosper's algorithm cannot compute m, it can compute lower and 


< 


upper bounds mı and mu such that mu -mı + 1 S max(l - 1, 1).) The 
algorithm works by comparing X,, for n = 1, 2, ..., to a subset of size 


| log an] +] of the elements of the sequence that precede Xy. The 
elements of the subset are the closest preceding X; such that i + 1 
ends in a 1-bit (that is, i is the even number preceding n), the closest 
preceding X; such that i + 1 ends in exactly one 0-bit, the closest 
preceding X; such that i + 1 ends in exactly two 0-bits, and so on. 


Figure 5-17 Gosper's loop-detection algorithm. 


void ld_Gosper(int (*f) (int), int X0, int *mu_l, 
int *mu_u, int *lambda) 
int Xn, k, m, kmax, n, lgl; 


int T[33]; 


Xn = £(Xn); 


kmax = 31 - nlz(n); // Floor(log2 n). 


{ 


for (k = 0; k <= kmax; k++) { 


if (Xn == T[k]) goto L; 
} 
T[ntz(nt+1)] = Xn; // No match. 
} 
L: 
// Compute m = max{i | i < n and ntz(it+l) = k}. 


m= ((((n >> k) = 1) | 1) << k) = 1; 

*lambda = n - m; 

lgl = 31 - nlz(*lambda - 1); // Ceil(log2 lambda) - 1. 

*mu_u = m; // Upper bound on mu. 

*mu_l = m - max(1, 1 << lgl) + 1;// Lower bound on mu. 
} 


Thus, the comparisons proceed as follows: 


Xi: Xo X; : Xo Xs, X; ED D E AS oe 
X, : Xy X, be. Pe EO AD EED TO Ge ome e 
Ksi Ka X, Lae Oe a Xa An ee PE Cae TO A 
Xai Xan Xoi Xup LED OD E A Oe T 
Mitek ke Xii Xolis b ED. T Nati teks 


Xe e X4 Xs, X3 Xiz ° X io Xo; x, l X; X ig s X16 Xiz XxX, I: Xz, Xis 


It can be shown that the algorithm always terminates with n 
somewhere in the second cycle—that is, with n < m + 2l. See [Knu2] 
for further details. 


The ruler function reveals how to solve the Tower of Hanoi puzzle. 
Number the n disks from 0 to n - 1. At each move k, as k goes from 1 


to 2" - 1, move disk ntz(k) the minimum permitted distance to the 
right, in a circular manner. 


The ruler function can be used to generate a reflected binary Gray 
code (see Section 13-1 on page 235). Start with an arbitrary n-bit 
word, and at each step k, as k goes from 1 to 22 - 1, flip bit ntz(k). 


Chapter 6. Searching Words 


Find First 0-Byte 


Find First String of 1-Bits of a Given Length 


6-1 Find First 0-Byte 


The need for this function stems mainly from the way character 
strings are represented in the C language. They have no explicit length 
stored with them; instead, the end of the string is denoted by an all-0 
byte. To find the length of a string, a C program uses the "strlen" 
(string length) function. This function searches the string, from left to 
right, for the 0-byte, and returns the number of bytes scanned, not 
counting the 0-byte. 


A fast implementation of "strlen" might load and test single bytes until 
a word boundary is reached, and then load a word at a time into a 
register, and test the register for the presence of a 0-byte. On big- 
endian machines, we want a function that returns the index of the first 
0-byte from the left. A convenient encoding is values from 0 to 3 
denoting bytes 0 to 3, and a value of 4 denoting that there is no 0-byte 
in the word. This is the value to add to the string length, as successive 
words are searched, if the string length is initialized to 0. On little- 
endian machines, one wants the index of the first 0-byte from the 
right end of the register, because little-endian machines reverse the 
four bytes when a word is loaded into a register. Specifically, we are 
interested in the following functions, where "00" denotes a 0-byte, "nn" 
denotes a nonzero byte, and "xx" denotes a byte that may be 0 or 
nonzero. 


0, x = OOXxxxxxx, 0, x = xxxxxx00, 
l, x = nnOOxxxx, l, x = xxxx0O0nn, 
zbytel(x) = 42, x = nnnnOOxx, zbyter(x) = 42, x = xx00nnnn, 
3, x = nnnnnnodo, 3, x = O0Onnnnnn, 
4, x = nnnnnnnn. 4, x = nnnnnnnn. 


Our first procedure for the find leftmost 0-byte function, shown in 
Figure 6-1, simply tests each byte, in left-to-right order, and returns 
the result when the first 0-byte is found. 


Figure 6-1 Find leftmost 0-byte, simple sequence of tests. 


int zbytel(unsigned x) { 


if ((x >> 24) == 0) return 0; 


else if ((x & 0x00FF0000) == 0) return 1; 
else if ((x & 0x0000FF00) == 0) return 2; 


else if ((x & 0x000000FF) == 0) return 3; 


else return 4; 


} 


This executes in two to 11 basic RISC instructions, 11 in the case that 
the word has no 0-bytes (which is the important case for the "strlen" 
function). A very similar program will handle the problem of finding 
the rightmost 0-byte. 


Figure 6-2 shows a branch-free procedure for this function. The idea is 
to convert each 0-byte to 0x80, and each nonzero byte to 0x00, and 
then use number of leading zeros. This procedure executes in eight 
instructions if the machine has the number of leading zeros and nor 
instructions. Some similar tricks are described in [Lamp]. 


Figure 6-2 Find leftmost 0-byte, branch-free code. 


int zbytel(unsigned x) { 


unsigned y; 


int n; 

// Original byte: 00 80 other 
y = (x & 0x7F7F7F7F) + 0x7F7F7F7F; // TE 7F lxxxxxx: 
y= ~(y | x | Ox7F7F7FT7F) ; // 80 00 o000000¢ 
n = nlz(y) >> 3; //n=0... 4, 4 if x 
return n; // has no 0-byte. 


} 


The position of the rightmost 0-byte is given by the number of trailing 
O's in the final value of y computed above, divided by 8 (with 
fraction discarded). Using the expression for computing the number of 
trailing O's by means of the number of leading zeros instruction (see 
Section 5- 4, "Counting Trailing 0's," on page 84), this can be 


computed by replacing the assignment to n in the procedure above 
with: 


n = (32 - nlz(~y & (y - 1))) >> 3; 
This is a 12-instruction solution, if the machine has nor and and not. 


In most situations on PowerPC, incidentally, a procedure to find the 
rightmost 0-byte would not be needed. Instead, the words can be 
loaded with the load word byte-reverse instruction (1lwbrx). 


The procedure of Figure 6-2 is more valuable on a 64-bit machine 
than on a 32-bit one, because on a 64-bit machine the procedure (with 
obvious modifications) requires about the same number of instructions 
(seven or ten, depending upon how the constant is generated), 
whereas the technique of Figure 6-1 requires 23 instructions worst 
case. 


If only a test for the presence of a 0-byte is wanted, then a branch on 
zero (or nonzero) can be inserted just after the second assignment to 


y. 
If the "nlz" instruction is not available, there does not seem to be any 


really good way to compute the find first 0-byte function. Figure 6-3 
shows a possibility (only the executable part of the code is shown). 


Figure 6-3 Find leftmost 0-byte, not using n1z. 


// Original byte: 00 80 other 


y = (x & OxX7F7F7E7F) + 0x7F7F7F7F; // 7EF 7F 1xxxxxxx 


y = ~(y | x | Ox7F7F7F7F) ; // 80 00 00000000 


// These steps map: 


if (y == 0) return 4; // 00000000 ==> 
else if (y > Ox0000FFFF) // 80XXXXXX ==> 

return (y >> 31) ^ 1; // 0080xxxx ==> 
else // 000080xx ==> 


return (y >> 15) ^ 3; // 00000080 ==> 


This executes in ten to 13 basic RISC instructions, ten in the all- 
nonzero case. Thus, it is probably not as good as the code of Figure 
6-1, although it does have fewer branch instructions. It does not scale 
very well to 64-bit machines, unfortunately. 


There are other possibilities for avoiding the "nlz" function. The value 
of y computed by the code of Figure 6-3 consists of four bytes, each 
of which is either 0x00 or 0x80. The remainder after dividing such a 
number by 0x7F is the original value with the up-to-four 1-bits moved 
and compressed to the four rightmost positions. Thus, the remainder 
ranges from 0 to 15 and uniquely identifies the original number. For 
example, 


remu(Ox8080 8080, 127) = 15, 
remu(Ox8000 0000, 127) = 8, 
remu(Ox0000 8080, 127) = 3, ete. 


This value can be used to index a table, 16 bytes in size, to get the 
desired result. 


Thus, the code beginning if (y == 0) can be replaced with 


static char table[16] = {4, 3, 2, 2, 1, 1, 1, 1, 


return table[y%127]; 


where y is unsigned. The number 31 can be used in place of 127, but 
with a different table. 


These methods involving dividing by 127 or 31 are really just 
curiosities, because the remainder function is apt to require 20 cycles 
or more even if directly implemented in hardware. However, below 
are two more efficient replacements for the code in Figure 6-3 
beginning with if (y == 0): 


return table[hopu(y, 0x02040810) & 15]; 
return table[y*0x00204081 >> 28]; 


Here, hopu(a, b) denotes the high-order 32 bits of the unsigned 


product of a and b. In the second line, we assume the usual HLL 
convention that the value of the multiplication is the low-order 32 bits 
of the complete product. This might be a practical method, if either 
the machine has a fast multiply or the multiplication by 0x204081 is 
done by shift-and-add's. It can be done in four such instructions, as 
suggested by 


y(1+ 274+ 2144221) = y(1 +271 +2}, 


Using this 4-cycle way to do the multiplication, the total time for the 
procedure comes to 13 cycles (7 to compute y, plus 4 for the shift- 
and-add's, plus 2 for the shift right of 28 and the table index), and of 
course it is branch-free. 


These scale reasonably well to a 64-bit machine. For the "modulus" 
method, use 


return table[y%511]; 


where table is of size 256, with values 8, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 
0, 2, 0, 1, 0, 4, ... i.e., table[i] = number of trailing O's in i). 


For the multiplicative methods, use either 
return table[hopu(y, 0x0204081020408100) & 255]; or 
return table[ (y*0x0002040810204081»56] ; 


where table is of size 256, with values 8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 
4, 4, 4, 4, 4, 3, .... 


The multiplication by 0x2 0408 1020 4081 can be done with 
tie pli +2’) 
tet (1 +2") 


which gives a 13-cycle solution. 


All these variations using the table can, of course, implement the find 
rightmost 0-byte function by simply changing the data in the table. 


If the machine does not have the nor instruction, the not in the second 
assignment to y in Figure 6-3 can be omitted, in the case of a 32-bit 
machine, by using one of the three return statements given above, 
with table[i] = 0,0, 0,0, 0, 0, 0, O, 1, 1, 1, 1, 2, 2, 3, 4. This 
scheme does not quite work on a 64-bit machine. 


Here is an interesting variation on the procedure of Figure 6-2, again 
aimed at machines that do not have number of leading zeros. Let a, b, 
c, and d be 1-bit variables for the predicates "the first byte of x is 
nonzero," "the second byte of x is nonzero," and so on. Then, 


zbytel(x) = a+ ab+ abe+ abed. 


The multiplications can be done with and's, leading to the procedure 
shown in Figure 6-4 (only the executable code is shown). 


Figure 6-4 Find leftmost 0-byte by evaluating a polynomial. 


y = (x & 0x7F7F7F7F) + 0x7F7F7F7F; 

y= y | x; // Leading 1 on nonzero bytes. 
tl = y >> 31; // tl=a. 

t2 = (y >> 23) & tl; // t2 = ab. 

t3 = (y >> 15) & t2; // t3 = abc. 

t4 = (y >> 7) & t3; // t4 = abcd. 


return EL + t2 + C3 + E4; 


This comes to 15 instructions on the basic RISC, which is not 
particularly fast, but there is a certain amount of parallelism. On a 
superscalar machine that can execute up to three arithmetic 
instructions in parallel, provided they are independent, it comes to 
only ten cycles. 


A simple variation of this does the find rightmost 0-byte function, 


based on 


zbyter(x) = abed + bed + ed + d. 


(This requires one more and than the code of Figure 6-4.) 
Some Simple Generalizations 


Functions "zbytel" and "zbyter" can be used to search for a byte equal 
to any particular value, by first exclusive or'ing the argument x with a 
word consisting of the desired value replicated in each byte position. 


For example, to search x for an ASCII blank (0x20), search x 
0x20202020 for a 0-byte. 


Similarly, to search for a byte position in which two words x and y are 
equal, search x “Ly for a 0-byte. 


There is nothing special about byte boundaries in the code of Figure 
6-2 and its variants. For example, to search a word for a O-value in 
any of the first four bits, the next 12, or the last 16, use the code of 
Figure 6-2 with the mask replaced by 0x77FF7FFF [PHO]. (if a field 
length is 1, use a 0 in the mask at that position.) 


Searching for a Value in a Given Range 


The code of Figure 6-2 can easily be modified to search for a byte in 
the range 0 to any specified value less than 128. To illustrate, the 
following code finds the index of the leftmost byte having value from 
0 to 9: 


y = (x & 0x7F7F7F7F) + 0x76767676; 

yoy | x 

y = y | 0x7F7F7F7F; // Bytes > 9 are OxFF. 
y = syi // Bytes > 9 are 0x00, 


// bytes <= 9 are 0x80. 


More generally, suppose you want to find the leftmost byte in a word 
that is in the range a to b, where the difference between a and b is less 
than 128. For example, the uppercase letters encoded in ASCII range 
from 0x41 to Ox5A. To find the first uppercase letter in a word, 
subtract 0x41414141 in such a way that the borrow does not 
propagate across byte boundaries, and then use the above code to 
identify bytes having value from 0 to 0x19 (0x5A - 0x41). Using the 
formulas for subtraction given in Section 2-17, "Multibyte Add, 
Subtract, Absolute Value," on page 36, with obvious simplifications 
possible with y = 0x41414141, gives 


d = (x | 0x80808080) - 0x41414141; 

d = ~((x | Ox7F7F7F7F) ^ d); 

y = (d & OX7F7F7F7F) + 0x66666666; 

y=y |; 

y = y | 0x7F7F7F7F; // Bytes not from 41-5A are FF. 
y = ~y; // Bytes not from 41-5A are 00, 


// bytes from 41-5A are 80. 
n = nlz(y) >> 3; 


For some ranges of values, simpler code exists. For example, to find 
the first byte whose value is 0x30 to 0x39 (a decimal digit encoded in 
ASCII), simply exclusive or the input word with 0x30303030 and then 
use the code given above to search for a value in the range 0 to 9. 
(This simplification is applicable when the upper and lower limits 
have n high-order bits in common, and the lower limit ends with 8 - n 
o's.) 


These techniques can be adapted to handle ranges of 128 or larger 
with no additional instructions. For example, to find the index of the 
leftmost byte whose value is in the range 0 to 137 (0x89), simply 
change the line y = y | xto y = y & xin the code above for 
searching for a value from 0 to 9. 


Similarly, changing the line y = y | dto y = y & din the code 
for finding the leftmost byte whose value is in the range 0x41 to Ox5A 
causes it to find the leftmost byte whose value is in the range 0x41 to 
OxDA. 


6-2 Find First String of 1-Bits of a Given Length 


The problem here is to search a word in a register for the first string of 
1-bits of a given length n or longer, and to return its position, with 
some special indication if no such string exists. Variants are to return 
only the yes/no indication, and to locate the first string of exactly n 1- 
bits. This problem has application in disk-allocation programs, 
particularly for disk compaction (rearranging data on a disk so that all 
blocks used to store a file are contiguous). The problem was suggested 
to me by Albert Chang, who pointed out that it is one of the uses for 
the number of leading zeros instruction. 


We assume here that the number of leading zeros instruction, or a 
suitable subroutine for that function, is available. 


An algorithm that immediately comes to mind is to first count the 
number of leading O's and skip over them by shifting left by the 
number obtained. Then count the leading 1's by inverting and 
counting leading 0's. If this is of sufficient length, we are done. 
Otherwise, shift left by the number obtained and repeat from the 
beginning. This algorithm might be coded as shown below. If n 
consecutive 1-bits are found, it returns a number from 0 to 31, giving 
the position of the leftmost 1-bit in the leftmost such sequence. 
Otherwise, it returns 32 as a "not found" indication. 


int ffstrl(unsigned x, int n) { 


int k, p; 
p = 0; // Initialize position to return. 
while (x != 0) { 

k = nlz(x); // Skip over initial 0's 

x = x << k; // (if any). 

p=ptk; 

k = nlz (~x); // Count first/next group of 1's. 


if (k >= n) // If enough, 


return p; // return. 
x = x << k; // Not enough 1's, skip over 


p + k; // them. 


oO 
ll 


} 
return 32; 


} 


This algorithm is reasonable if it is expected that the loop will not be 
executed very many times—for example, if it is expected that x will 
have long sequences of 1's and of 0's. This might very well be the 
expectation in the disk-allocation application. Its worst-case execution 
time, however, is not very good; for example, about 178 full RISC 


instructions executed for x = 0x55555555 and n = =>. 


An algorithm that is better in worst-case execution time is based on a 
sequence of shift left and and instructions. To see how this works, 
consider searching for a string of eight or more consecutive 1-bits in a 
32-bit word x. This might be done as follows: 


xexn&(x <= 1) 
xe N & (x <= 2) 
xex& (x <4) 


After the first assignment, the 1's in x indicate the starting positions of 
strings of length 2. After the second assignment, the 1's in x indicate 
the starting positions of strings of length 4 (a string of length 2 
followed by another string of length 2). After the third assignment, the 
1's in x indicate the starting positions of strings of length 8. Executing 
number of leading zeros on this word gives the position of the first 
string of length 8 (or more), or 32 if none exists. 


To develop an algorithm that works for any length n from 1 to 32, we 
will look at this a little differently. First, observe that the above three 
assignments may be done in any order. Reverse order will be more 
convenient. To illustrate the general method, consider the case n = 
10: 


xex &(x << 5) 
X, x & (x, = 2) 
XX, & (x, <= 1) 

) 


a Pa E S n oe 
xe x, & (x, <1 


The first statement shifts by n/2. After executing it, the problem is 
reduced to finding a string of five consecutive 1-bits in x1. This may 


Py) = 9 
be done by shifting left by E f 2 | — “*and'ing, and searching the 
result for a string of length 3 (5 - 2). The last two statements identify 


where the strings of length 3 are in x2. The sum of the shift amounts is 
always n - 1. The algorithm is shown in Figure 6-5. The execution 
time ranges from 3 to 36 full RISC instructions, as n ranges from 1 to 
32. 


If n is often moderately large, it is not unreasonable to unroll this 
loop by repeating the loop body five times and omitting the test n > 
1. (Five is always sufficient for a 32-bit machine.) This gives a branch- 
free algorithm that runs in a constant time of 20 instructions executed 
(the last assignment to n can be omitted). Although for small values 
of n the three assignments are executed more than necessary, the 
result is unchanged by the extra steps because variable n sticks at the 
value 1, and for this value the three steps have no effect on xor n. 


2s, in 


The unrolled version is faster than the looping version for n 
terms of number of instructions executed. 


Figure 6-5 Find first string of n 1's, shift-and-and sequence. 


int ffstrl(unsigned x, int n) { 


int s; 


while (n > 1) { 
s = n >> 1; 


x = X & (x << s); 


return nilz (x); 
} 


A string of exactly n 1-bits can be found in six more instructions (four 
if and not is available). The quantity x computed by the algorithm of 
Figure 6-5 has 1-bits wherever a string of length n or more 1-bits 
begins. Hence, using the final value of x computed by that algorithm, 
the expression 


x& a(x l&altx<=1) 


contains a 1-bit wherever the final x contains an isolated 1-bit, which 
is to say wherever the original x began a string of exactly n 1-bits. 


The algorithm is also easily adapted to finding strings of length n that 
begin at certain locations. For example, to find strings that begin at 
byte boundaries, simply and the final x with 0x80808080. 

It can be used to find strings of 0-bits either by complementing x at 
the start, or by changing the and's to or's and complementing x just 
before invoking "nlz." For example, below is an algorithm for finding 


the first (leftmost) 0-byte (see Section 6-1, "Find First 0-Byte," on page 
91, for a precise definition of this problem). 


xeex | (x=4) 
xeox | (x <2) 
xex | (x< 1) 


x = 0x7F7F7F7F | x 


p <—nilz(x) = 3 


This executes in 12 instructions on the full RISC (not as good as the 
algorithm of Figure 6-2 on page 92, which executes in eight 


instructions). 


Chapter 7. Rearranging Bits and Bytes 


Reversing Bits and Bytes 

Shuffling Bits 

Transposing a Bit Matrix 

Compress, or Generalized Extract 

General Permutations, Sheep and Goats Operation 


Rearrangements and Index Transformations 


7-1 Reversing Bits and Bytes 


By "reversing bits" we mean to reflect the contents of a register about 
the middle so that, for example, 


rev(0x0123.4567) = OxE6A2 C480. 


By "reversing bytes" we mean a similar reflection of the four bytes of a 
register. Byte reversal is a necessary operation to convert data 
between the "little-endian" format used by DEC and Intel and the "big- 
endian" format used by most other manufacturers. 


Bit reversal can be done quite efficiently by interchanging adjacent 
single bits, then interchanging adjacent 2-bit fields, and so on, as 
shown below [Aus1]. These five assignment statements can be 
executed in any order. 


x = (x & 0x55555555) << 1 (x & OxAAAAAAAA) >> 1; 
x = (x & 0x33333333) << 2 (x & OxCCCCCCCC) >> 2; 
x = (x & OxXOFOFOFOF) << 4 (x & OxFOFOFOFO) >> 4; 


x = (x & OxXOOFFOOFF) << 8 (x & OxFFOOFFO0O) >> 8; 


x = (x & OxXOOOOFFFF) << 16 (x & OxFFFFOO0O00) >> 16; 
A small improvement results on most machines by using fewer distinct 
large constants and doing the last two assignments in a more 


straightforward way, as is shown in Figure 7-1 (30 basic RISC 
instructions, branch-free). 


Figure 7-1 Reversing bits. 


unsigned rev(unsigned x) { 


(x & 0x55555555) << 1 | (x >> 1) & 0x55555555; 


x 
x = (x & 0x33333333) << 2 | (x >> 2) & 0x33333333; 
x = (x & OxOFOFOFOF) << 4 | (x >> 4) & OxOFOFOFOF; 


x = (x << 24) | ((x & OxFFOO) << 8) | 


((x >> 8) & OxFFO0O) | (x >> 24); 
return x; 


} 


The last assignment to x in this code does byte reversal in nine basic 
RISC instructions. If the machine has rotate shifts, however, this can 
instead be done in seven instructions with 


x = ((x & OxOOFFOOFF) 8) | ((x 8) & OxO0FFOOFF). 


PowerPC can do the byte-reversal operation in only three instructions 
[Hay1]: a rotate left of 8, which positions two of the bytes, followed 
by two "rlwimi" (rotate left word immediate then mask insert) 
instructions. 


Generalized Bit Reversal 


[GLS1] suggests that the following sort of generalization of bit 
reversal, which he calls "flip," is a good candidate to consider for a 
computer's instruction set: 


if (k & 1) x = (x & 0x55555555) << 1 (x & OxAAAAAAAA) 


if (k & 2) x 


(x & 0x33333333) << 2 (x & OxCCCCCCCC) 


if (k & 4) 


x 
ll 


(x & OxOFOFOFOF) << 4 (x & OxFOFOFOFO) 


if (k & 8) x (x & OxOOFFOOFF) << 8 (x & OxFFOOFFOO) 


if (k & 16) x 


(x & OxOOQOOOFFFF) << 16 (x & OxXFFFFOOOO) 


(The last two and operations can be omitted.) For k = 31, this 
operation reverses the bits in a word. For k = 24, it reverses the 
bytes in a word. For k = 7, it reverses the bits in each byte, without 
changing the positions of the bytes. For k = 16, it swaps the left and 
right halfwords of a word, and so on. In general, it moves the bit at 


position m to position m . It can be implemented in hardware, 
very similarly to the way a rotate shifter is usually implemented (five 


stages of MUX's, with each stage controlled by a bit of the shift 
amount k). 


Bit-Reversing Novelties 


Item 167 in [HAK] contains rather esoteric expressions for reversing 
6-, 7-, and 8-bit integers. Although these expressions are designed for 
a 36-bit machine, the one for reversing a 6-bit integer works on a 32- 
bit machine, and those for 7- and 8-bit integers work on a 64-bit 
machine. These expressions are as follows: 


6-bit: remu((x + 0x00082082) & 0x01122408, 255) 
7-bit: remu((x * 0x40100401) & 0x4 42211008, 255) 
8-bit: remu((.x * 0x2 02020202) & 0x108 84422010, 1023) 


The result of all these is a "clean" integer—right-adjusted with no 
unused high-order bits set. 


In all these cases the "remu" function can instead be "rem" or "mod," 
because its arguments are positive. The remainder function is simply 
summing the digits of a base 256 or base 1024 number, much like 
casting out nines. Hence it can be replaced with a multiply and a shift 
right. For example, the 6-bit formula has the following alternative on 
a 32-bit machine (the multiplication must be modulo 232): 


f— (x + Ox00082082) & 0x01122408 
(f+ OxOL010101) = 24 


These formulas are limited in their utility because they involve a 
remaindering operation (20 cycles or more) and/or some 
multiplications, as well as loading of large constants. The formula 
immediately above requires ten basic RISC instructions, two of which 
are multiply's, which amounts to about 20 cycles on a present-day 
RISC. On the other hand, an adaptation of the code of Figure 7-1 to 
reverse 6-bit integers requires about 15 instructions, and probably 
about 9 to 15 cycles, depending on the amount of instruction-level 
parallelism in the machine. These techniques, however, do give 
compact code. Below are a few more techniques that might possibly 
be useful, all for a 32-bit machine. They involve a sort of double 
application of the idea from [HAK], to extend the technique to 8- and 


9-bit integers on a 32-bit machine. 


The following is a formula for reversing an 8-bit integer: 


s — (x + 0x02020202) & 0x84422010 
fe {x + 8) & Ox00000420 
remu(s + f, 1023) 


Here the "remu" cannot be changed to a multiply and shift. (You have 
to work these out, and look at the bit patterns, to see why.) 


Here is a similar formula for reversing an 8-bit integer, which is 
interesting because it can be simplified quite a bit: 


s — (x * 0x00020202) & 0x01044010 
t (x + Ox0008 0808) & 0x02088020 
remu(s + f, 4095) 


The simplifications are that the second product is just a shift left of the 
first product, the last mask can be generated from the second with just 
one instruction (shift), and the remainder can be replaced by a 
multiply and shift. It simplifies to 14 basic RISC instructions, two of 
which are multiply's: 


u & x » O0x00020202 
m — 0x01044010 


se uA Mm 


f—(w=<2)&(m<= 1) 


(0x01001001 + (s + £)) = 24 


cat 


The following is a formula for reversing a 9-bit integer: 


s & (x + 0x0100 1001) & 0x84108010 
f— (x + Ox00040040) & 0x0084 1080 
remu(s + f, 1023) 


The second multiplication can be avoided because the product is equal 
to the first product shifted right six positions. The last mask is equal to 
the second mask shifted right eight positions. With these 
simplifications, this requires 12 basic RISC instructions, including the 
one multiply and one remainder. The remainder operation must be 
unsigned, and it cannot be changed to a multiply and shift. 


The reader who studies these marvels will be able to devise similar 
code for other bit-permuting operations. As a simple (and artificial) 
example, suppose it is desired to extract every other bit from an 8-bit 
quantity, and compress the four bits to the right. That is, the desired 
transformation is 


0000 0000 0000 0000 0000 0000 abcd efgh ==> 
0000 0000 0000 0000 0000 0000 0000 bdfh 


This may be computed as follows: 


fe (x « 0x01010101) & 0x40100401 
(t = 0x0804.0201) -> 27 


On most machines, the most practical way to do all these operations is 
by indexing into a table of 1-byte (or 9-bit) integers. 


incrementing a Reversed Integer 


The Fast Fourier Transform (FFT) algorithm employs an integer i and 
its bit reversal rev(i) in a loop in which i is incremented by 1 [PB]. 
Straightforward coding would increment i and then compute rev(i) on 
each loop iteration. For small integers, computing rev(i) by table 


lookup is fast and practical. For large integers, however, table lookup 
is not practical and, as we have seen, computing rev(i) requires some 
29 instructions. 


If table lookup cannot be used, it is more efficient to maintain i in 
both normal and bit-reversed forms, incrementing them both on each 
loop iteration. This raises the question of how best to increment an 
integer that is in a register in reversed form. To illustrate, on a 4-bit 
machine we wish to successively step through the values (in 
hexadecimal) 


In the FFT algorithm, i and its reversal are both some specific number 
of bits in length, almost certainly less than 32, and they are both 
right-justified in the register. However, we assume here that i is a 32- 
bit integer. After adding 1 to the reversed 32-bit integer, a shift right 
of the appropriate number of bits will make the result usable by the 
FFT algorithm (both i and rev(i) are used to index an array in 
memory). 


The straightforward way to increment a reversed integer is to scan 
from the left for the first 0-bit, set it to 1, and set all bits to the left of 
it Gif any) to O's. One way to code this is 


unsigned x, m; 


0x80000000; 


3 
ll 


} while (x < m); 
} 


This executes in three basic RISC instructions if x begins with a 0-bit, 
and four additional instructions for each loop iteration. Because x 


begins with a 0-bit half the time, with 10 (binary) one-fourth of the 
time, and so on, the average number of instructions executed is 
approximately 


g,449.441)-4415-42... 
2 4 8 16 
saelogel yy tage. be... l 
? 4 5 lô 


1 2 3 4 
= a(5+3+e+3...) - | 
\2 4 8 lò 


= 7. 


(In the second line we added and subtracted 1, with the first 1 in the 
form 1/2 + 1/4 + 1/8 + 1/16 + .... This makes the series similar to 
the one analyzed on page 86.) The number of instructions executed in 
the worst case, however, is quite large (131). 


If number of leading zeros is available, adding 1 to a reversed integer 
may be done as follows: 


First execute: s & nlz(—.x) 
and then either: x <— x ® (0x80000000 > s) 
or. xé ((x < s) + 0x80000000) = s 


Either method requires five full RISC instructions and, to properly 
wrap around from OxFFFFFFFF to 0, requires that the shifts be modulo 
64. (These formulas fail in this respect on the Intel x86 machines, 
because the shifts are modulo 32.) 


7-2 Shuffling Bits 


Another important permutation of the bits of a word is the "perfect 
shuffle" operation, which has applications in cryptography. There are 
two varieties, called the "outer" and "inner" perfect shuffles. They both 
interleave the bits in the two halves of a word in a manner similar to a 
perfect shuffle of a deck of 32 cards, but they differ in which card is 
allowed to fall first. In the outer perfect shuffle, the outer (end) bits 
remain in the outer positions, and in the inner perfect shuffle, bit 15 
moves to the left end of the word (position 31). If the 32-bit word is 
(where each letter denotes a single bit) 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP, 
then after the outer perfect shuffle it is 

aAbB cCdD eEfF gGhH iIjJ kK1L mMnN oOpP, 
and after the inner perfect shuffle it is 

AaBb CcDd EeFf GgHh IiJj KkL1 MmNn OoPp. 


Assume the word size W is a power of 2. Then the outer perfect shuffle 
operation can be accomplished with basic RISC instructions in 
logo(W/2) steps, where each step swaps the second and third quartiles 
of successively smaller pieces [GLS1]. That is, a 32-bit word is 
transformed as follows: 


abcd efgh ijkl mnop ABCD EFGH IJKL MNOP 
abcd efgh ABCD EFGH ijkl mnop IJKL MNOP 
abcd ABCD efgh EFGH ijkl IJKL mnop MNOP 


abAB cdCD efEF ghGH ijIJ k1KL mnMN opOP 


aAbB cCdD eEfF gGhH iIjJ kK1L mMnN oOpP 


Straightforward code for this is 


x = (x & OxO0000FF00) << 8 | (x >> 8) & Ox0000FF00 | x & < 
x = (x & 0x00F000F0) << 4 | (x >> 4) & OxOOFOOOFO | x «& ( 
x = (x & Ox0COCOCOC) << 2 | (x >> 2) & OxOCOCOCOC | x « < 


x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 


which requires 42 basic RISC instructions. This can be reduced to 30 
instructions, although at an increase from 17 to 21 cycles on a 
machine with unlimited instruction-level parallelism, by using the 
exclusive or method of exchanging two fields of a register (described 
on page 40). All quantities are unsigned: 


t = (x ^ (x >> 8)) & O0x0000FF00; x=x^t ^ (t << 8); 
t = (x ^ (x >> 4)) & OxO0FOOOFO; x=x^t ^ (t << 4); 
t = (x ^ (x >> 2)) & OxOCOCOCO0C; SoS x A tA (E << 27} 
t = (x ^ (x >> 1)) & 0x22222222; x=x^t ^ (t << 1); 


The inverse operation, the outer unshuffle, is easily accomplished by 
performing the swaps in reverse order: 


t = (x ^ (x >> 1)) & 0x22222222; x=x ^ t ^ (t << 1); 


t (x >> 2)) & OxOCOCOCOC; x =x * t ^ (t << 2); 


1 
x 


t = (x ^ (x >> 4)) & OxO0FOOOFO; x =x * t ^ (t << 4); 
t = (x ^ (x >> 8)) & OxO0000FFOO0; x =x * t ^ (t << 8); 


Using only the last two steps of either of the above two shuffle 
sequences shuffles the bits of each byte separately. Using only the last 
three steps shuffles the bits of each halfword separately, and so on. 
Similar remarks apply to unshuffling, except by using the first two or 
three steps. 


To get the inner perfect shuffle, prepend to these sequences a step to 
swap the left and right halves of the register: 


x = (x >> 16) | (x << 16); 


(or use a rotate of 16 bit positions). The unshuffle sequence can be 
similarly modified by appending this line of code. 


Altering the transformation to swap the first and fourth quartiles of 
successively smaller pieces produces the bit reversal of the inner 
perfect shuffle. 


Perhaps worth mentioning is the special case in which the left half of 
the word x is all 0. In other words, we want to move the bits in the 


right half of x to every other bit position—that is, to transform the 
32-bit word 


0000 0000 0000 0000 ABCD EFGH IJKL MNOP 
to 
OAOB OCOD OEOF OGOH OIOJ OKOL OMON OOOP. 


The outer perfect shuffle code can be simplified to do this task in 22 
basic RISC instructions. The code below, however, does it in only 19 
basic RISC instructions, at no cost in execution time on a machine 
with unlimited instruction-level parallelism (12 cycles with either 
method). This code does not require that the left half of word x be 
initially cleared. 


x = ((x & OxFFOO) << 8) | (x & Ox00FF); 
x = ((x << 4) | x) & OxOFOFOFOF; 
x = ((x << 2) | x) & 0x33333333; 
x = ((x << 1) | x) & 0x55555555; 


Similarly, for the inverse of this "half shuffle" operation (a special case 
of compress; see page 116), the outer perfect unshuffle code can be 
simplified to do the task in 26 or 29 basic RISC instructions, 
depending on whether or not an initial and operation is required to 
clear the bits in the odd positions. The code below, however, does it in 
only 18 or 21 basic RISC instructions, and with less execution time on 
a machine with unlimited instruction-level parallelism (12 or 15 
cycles). 


x = x & 0x55555555; // (If required.) 
x = ((x >> 1) x) & 0x33333333; 
x = ((x >> 2) x) & OxOFOFOFOF; 


x = ((x >> 4) x) & OxOOFFOOFF; 


x = ((x >> 8) x) & OxOOOOFFFF; 


7-3 Transposing a Bit Matrix 


The transpose of a matrix A is a matrix whose columns are the rows of 
A and whose rows are the columns of A. Here we consider the 
problem of computing the transpose of a bit matrix whose elements 
are single bits that are packed eight per byte, with rows and columns 
beginning on byte boundaries. This seemingly simple transformation is 
surprisingly costly in instructions executed. 


On most machines it would be very slow to load and store individual 
bits, mainly due to the code that would be required to extract and 
(worse yet) to store individual bits. A better method is to partition the 
matrix into 8x8 submatrices, load each 8x8 submatrix into registers, 
compute the transpose of the submatrix in registers, and then store the 
8x8 result in the appropriate place in the target matrix. This section 
first discusses the problem of computing the transpose of the 8x8 
submatrix. 


It doesn't matter whether the matrix is stored in row-major or column- 
major order; computing the transpose consists of the same operations 
in either event. Assuming for discussion that it's in row-major order, 
an 8x8 submatrix is loaded into eight registers with eight load byte 
instructions, addressing a column of the source matrix. That is, the 
addresses referenced by the load byte instructions are separated by 
multiples of the source matrix width in bytes. After the transpose of 
the 8x8 submatrix is computed, it is stored in a column of the target 
matrix—that is, it is stored with eight store byte instructions into 
locations separated by multiples of the width of the target matrix in 
bytes (which is different from the width of the source matrix if the 
matrices are not square). Thus, we are given eight 8-bit quantities 
right-justified in registers a0, al,..., a7, and we wish to compute 
eight 8-bit quantities right-justified in registers b0, b1, ..., b7, for 
use in the store byte instructions. This is illustrated below, where each 
digit and letter represents a single bit. Notice that we consider the 
main diagonal to run from bit 7 of byte 0 to bit 0 of byte 7. Some 
readers with a little-endian background may be accustomed to 
thinking of the main diagonal as running from bit 0 of byte 0 to bit 7 
of byte 7. 


a0 = 0123 4567 bO = 08go wEMU 
al = 89ab cdef b1 = 19hp xFNV 
a2 = ghij klmn b2 = 2aiq yGOW 


a3 


a4 


a5 


a6 


a7 


The straightforward code for this problem is to select and place each 


opqr stuv ==> 
wxyz ABCD 
EFGH IJKL 
MNOP QORST 


UVWX YZS. 


b3 


b4 


b5 


b6 


b7 


= 3bjr 
= 4cks 


5dlt 


= 6emu 


= 7fnv 


zHPX 
AIQY 
BJRZ 
CKS$ 


DLT. 


result bit individually, as follows. The multiplications and divisions 
represent left and right shifts, respectively. 


bo 


b1 


b2 


b3 


b4 


b5 


b6 


b7 


(a0 & 128) 
(a4 & 128)/16 
(a0 & 64)*2 
(a4 & 64)/8 
(a0 & 32)*4 
(a4 & 32)/4 
(a0 & 16)*8 
(a4 & 16)/2 
(a0 & 8) *16 
(a4 & 8) 
(a0 & 4) *32 


(a4 & 1) *8 


(al 


(a5 


(a5 


& 128) /2 
& 128) /32 
& 64) 

& 64)/16 
& 32)*2 
& 32)/8 
& 16)*4 
& 16)/4 
& 8) *8 
& 8) /2 
& 4)*16 
& 4) 

& 2) *32 
& 2) %*2 
& 1) *64 
& 1)*4 


(a2 


(a6 


& 


& 


128) /4 
128) /64 
64) /2 


64) /32 


1) *2 


This executes in 174 instructions on most machines (62 and's, 56 


shift's, and 56 or's). The or's can of course be add's. On PowerPC it can 


be done, perhaps surprisingly, in 63 instructions (seven move 
register's and 56 rotate left word immediate then mask insert's). We 
are not counting the load byte and store byte instructions, nor their 
addressing code. 


Although there does not seem to be a really great algorithm for this 
problem, the method to be described beats the straightforward 
method by more than a factor of 2 on a basic RISC machine. 


First, treat the 8x8-bit matrix as 16 2x2-bit matrices, and transpose 
each of the 16 2x2-bit matrices. Second, treat the matrix as four 2x2 
submatrices whose elements are 2x2-bit matrices and transpose each 
of the four 2x2 submatrices. Finally, treat the matrix as a 2x2 matrix 
whose elements are 4x4-bit matrices, and transpose the 2x2 matrix. 
These transformations are illustrated below. 


0123 4567 082a 4c6e O8go 4cks 08go wEMU 
89ab cdef 193b 5da7£ 19hp 5dlt 19hp xFNV 
ghij klmn goiq ksmu 2aiq 6emu 2aiq yGOw 
opgr stuv hp jr ltnv 3bjr 7fnv 3b jr zHPX 
==> ==> ==> 
wxyz ABCD wkyG AICK wEMU AIQY 4cks AIQY 
EFGH IJKL xFZH BJUDL xXFNV BJURZ 5dlt BJRZ 
MNOP QORST MUOW QYS$ yGOW CKS$ 6emu CKSS 
UVWX YZS. NVPX RZT. ZHPX DLT. 7fnv DLT. 


Rather than carrying out these steps on the eight individual bytes in 
eight registers, a net improvement results from first packing the bytes 
four to a register, performing the bit-swaps on the two registers, and 
then unpacking. A complete procedure is shown in Figure 7-2. 
Parameter A is the address of the first byte of an 8x8 submatrix of the 
source matrix, which is of size 8mx8n bits. Similarly, parameter B is 
the address of the first byte of an 8x8 submatrix in the target matrix, 
which is of size 8nx8m bits. That is, the full source matrix is 8mxn 
bytes, and the full target matrix is 8nxm bytes. 


Figure 7-2 Transposing an 8x8-bit matrix. 


void transpose8(unsigned char A[8], int m, int n, 
unsigned char B[8]) { 


unsigned x, y, t; 


// Load the array and pack it into x and y. 


x = (A[0]<<24) | (A[m]<<16) | (A[2*m]<<8) | A[3*m] 
y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m] 
t = (x ^ (x >> 7)) & OxOOAAOOAA; x=x* t ^ (t << 7) 
t = (y ^ (y >> 7)) & OxOOAAOOAA; y=y ^t ^ (t << 7) 
t = (x ^ (x >>14)) & Ox0000CCCC; x= x ^ t ^ (t <<14) 
t = (y ^ (y >>14)) & Ox0000CCCC; y =y ^t ^ (t <<14) 
t = (x & OxFOFOFOFO) | ((y >> 4) & OxOFOFOFOF) ; 

y = ((x << 4) & OxFOFOFOFO) | (y & OxOFOFOFOF) ; 

x= 

B[O]=x>>24; B[n]=x>>16; B[2*n]=x>>8; B[3*n]=x; 


B[4*n]=y>>24; B[5*n]=y>>16; B[6*n]=y>>8; B[7*n]=y; 
} 
The line 


t = (x ^ (x >> 7)) & OxOOAAOOAA; SS OS hem Ce << 7} 


is quite cryptic, for sure. It is swapping bits 1 and 8 (counting from 
the right), 3 and 10, 5 and 12, and so on, in word x, while not 
moving bits 0, 2, 4, and so on. The swaps are done with the exclusive 
or method of bit swapping, described on page 40. Word x, before and 
after the first round of swaps, is 


0123 4567 89ab cdef ghij klmn opgr stuv 
082a 4c6e 193b 5d7£ goiq ksmu hpjr ltnv 


To get a realistic comparison of these methods, the naive method 
described on page 109 was filled out into a complete program similar 
to that of Figure 7-2. Both were compiled with the GNU C compiler to 
a target machine that is very similar to the basic RISC. The resulting 
number of instructions, counting all load's, store's , addressing code, 
prologs, and epilogs, is 219 for the naive code and 101 for Figure 7-2. 
(The prologs and epilogs were null, except for a return branch 
instruction.) A version of the code of Figure 7-2 adapted to a 64-bit 
basic RISC (in which x and y would be held in the same register) 
would be about 85 instructions. 


The algorithm of Figure 7-2 runs from fine to coarse granularity, 
based on the lengths of the groups of bits that are swapped. The 
method can also be run from coarse to fine granularity. To do this, 
first treat the 8x8-bit matrix as a 2x2 matrix whose elements are 4x4- 
bit matrices, and transpose the 2x2 matrix. Then treat each the four 
4x4 submatrices as a 2x2 matrix whose elements are 2x2-bit matrices, 
and transpose each of the four 2x2 submatrices, and so on. The code 
for this is the same as that of Figure 7-2 except with the three groups 
of statements that do the bit-rearranging run in reverse order. 


Transposing a 32x32-Bit Matrix 


The same recursive technique that was used for the 8x8-bit matrix can 
of course be used for larger matrices. For a 32x32-bit matrix it takes 
five stages. 


The details are quite different from Figure 7-2 because here we 
assume that the entire 32x32-bit matrix does not fit in the general 
register space, and we seek a compact procedure that indexes the 
appropriate words of the bit matrix to do the bit swaps. The algorithm 
to be described works best if run from coarse to fine granularity. 


In the first stage, treat the matrix as four 16x16-bit matrices, and 
transform it as follows: 


A B| |4 Cy 
C D BD 


A denotes the left half of the first 16 words of the matrix, B denotes 
the right half of the first 16 words, and so on. It should be clear that 
the above transformation may be accomplished by the following 
swaps: 


Right half of word O with the left half of word 16, 


Right half of word 1 with the left half of word 17, 


Right half of word 15 with the left half of word 31. 


To implement this in code, we will have an index k that ranges from 0 
to 15. In a loop controlled by k, the right half of word k will be 
swapped with the left half of word k + 16. 


In the second stage, treat the matrix as 16 8x8-bit matrices, and 
transform it as follows: 


ABCD AE CG 
EF GH |8 F Diy 
PJAL IMKO 
MNOP JNLP 


This transformation may be accomplished by the following swaps: 
Bits OxOOFFOOFF of word 0 with bits OxFFOOFFOO of word 8, 


Bits OxOOFFOOFF of word 1 with bits OxFFOOFFOO of word 9, and so 
on. 


This means that bits 0-7 (the least significant eight bits) of word 0 are 
swapped with bits 8-15 of word 8, and so on. The indexes of the first 


word in these swaps are k = 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 
21, 22, 23. A way to step k through these values is 


k = (kK+ 9) & 8. 


In the loop controlled by k, bits of word k are swapped with bits of 
word k +8. 


Similarly, the third stage does the following swaps: 
Bits OxOFOFOFOF of word 0 with bits OxFOFOFOFO of word 4, 


Bits OxOFOFOFOF of word 1 with bits OxFOFOFOFO of word 5, and so 
on. 


The indexes of the first word in these swaps are k = 0, 1, 2, 3, 8, 9, 
10, 11, 16, 17, 18, 19, 24, 25, 26, 27. A way to step k through these 
values is 


kK = (k+5)& 4. 


In the loop controlled by k, bits of word k are swapped with bits of 
word k + 4. 


These considerations are coded rather compactly in the C function 
shown in Figure 7-3 [GLS1]. The outer loop controls the five stages, 
with j taking on the values 16, 8, 4, 2, and 1. It also steps the mask 
m through the values OxOOOOFFFF, OxOOFFOOFF, OxOFOFOFOF, 
0x33333333, and 0x55555555. (The code for this, m = m ^ (m << 
j), is a nice little trick. It does not have an inverse, which is the main 
reason this code works best for coarse to fine transformations.) The 
inner loop steps k through the values described above. The inner loop 
body swaps the bits of a [k] identified by mask m with the bits of a 
[k+ 3] shifted right j and identified by m, which is equivalent to the 
bits of a [k+ 4] identified with the complement of m. The code for 
performing these swaps is an adaptation of the "three exclusive or" 
technique shown on page 39 column (c). 


Figure 7-3 Compact code for transposing a 32x32-bit matrix. 


void transpose32 (unsigned A[32]) { 


int j, k; 


unsigned m, t; 


m = 0x0000FFFF; 


} 


Based on compiling this function with the GNU C compiler to a 
machine very similar to the basic RISC, this compiles into 31 
instructions, with 20 in the inner loop and 7 in the outer loop but not 
in the inner loop. Thus, it executes in 4 + 5(7 + 16-20) = 1639 
instructions. In contrast, if this function were performed using 16 calls 
on the 8x8 transpose program of Figure 7-2, then it would take 
16(101 + 5) = 1696 instructions, assuming the 16 calls are "strung 
out." This includes five instructions for each function call (observed in 
compiled code). Thus, the two methods are, on the surface anyway, 
very nearly equal in execution time. 


On the other hand, for a 64-bit machine the code of Figure 7-3 can 
easily be modified to transpose a 64x64-bit matrix, and it would take 
about 4 + 6(7 + 32-20) = 3886 instructions. Doing the job with 64 
executions of the 8x8 transpose method would take about 64(85 + 5) 
= 5760 instructions. 


The algorithm works in place, and thus if it is used to transpose a 
larger matrix, additional steps are required to move 32x32-bit 
submatrices. It can be made to put the result matrix in an area distinct 
from the source matrix by separating out either the first or last 
execution of the "for j-loop" and having it store the result in the other 
area. 


About half the instructions executed by the function of Figure 7-3 are 
for loop control, and the function loads and stores the entire matrix 
five times. Would it be reasonable to reduce this overhead by 
unrolling the loops? It would, if you are looking for the ultimate in 
speed, if memory space is not a problem, if your machine's I-fetching 
can keep up with a large block of straight-line code, and, especially if 
the branches or loads are costly in execution time. The bulk of the 
program will be the six instructions that do the bit swaps repeated 80 
times (5 - 16). In addition, the program will need 32 load instructions 
to load the source matrix and 32 store instructions to store the result, 
for a total of at least 544 instructions. 


Our GNU C compiler will not unroll loops by such large factors (16 for 
the inner loop, five for the outer loop). Figure 7-4 outlines a program 
in which the unrolling is done by hand. This program is shown as not 
working in place, but it executes correctly in place, if that is desired, 
by invoking it with identical arguments. The number of "swap" lines is 
80. Our GNU C compiler for the basic RISC machine compiles this into 
576 instructions (branch-free, except for the function return), counting 
prologs and epilogs. This machine does not have the store multiple 
and load multiple instructions, but it can save and restore registers 
two at a time with store double and load double instructions. 


Figure 7-4 Straight-line code for transposing a 32x32-bit matrix. 


#define swap(a0, al, j, m) t = (a0 ^ (al >>j)) & m; \ 


ad ao * tp yN 


al = al ^ (t << j); 


void transpose32 (unsigned A[32], unsigned B[32]) { 
unsigned m, t; 
unsigned a0, al, a2, a3, a4, a5, a6, a7, 
a8, a9, al0, all, al2, al13, al4, al5, 
al6, al7, al8, al9, a20, a21, a22, a23, 


a24, a25, a26, a27, a28, a29, a30, a31; 


} 


a4 = A[ 4]; a5 = 


a28 = A[28]; a29 = 


m = 0x0000FFFF; 
swap (a0, al6, 16, 


swap (al, al7, 16, 


swap (a15, a31, 16, 
m = Ox0OFFOOFF; 
swap (a0, a8, 8, 


swap(al, a9, 8, 


swap(a28, a29, 1, 


swap (a30, a31, 1, 


Al 


29]; 


a2 = 


a6 = 


a30 = 


A[31]; 


= a7; 


= a3l; 


There is a way to squeeze a little more performance out of this if your 
machine has a rotate shift instruction (either left or right). The idea is 


to replace all the swap operations of Figure 7-4, which take six 
instructions each, with simpler swaps that do not involve a shift, 
which take four instructions each (use the swap macro given, with the 
shifts omitted). 


First, rotate right words A [16..31] (that is, A [k] for 16 Sx S31) 


16 bit positions. Second, swap the right halves of A[0] with A[16], i 
A[1] with A[17], and so on, similarly to the code of Figure 7-4. Third, 
rotate right words A[0..8] and A[24..31] by eight bit positions, and 
then swap the bits indicated by a mask of OxOOFFOOFF in words A[0] 
and A[8], A[1] and A[9], and so on, as in the code of Figure 7-4. After 
five stages of this, you don't quite have the transpose. Finally, you 
have to rotate left word A[1] by one bit position, A[2] by two bit 
positions, and so on (31 instructions). We do not show the code, but 
the steps are illustrated below for a 4x4-bit matrix. 


abcd abcd abij abij aeim aeim 
efgh _=efgh efmn __ nefm _ nbfj ___ bfjn 
ijkl ~° klij ©” kled ” kled ~~ kocg ~ cgko 
mnop opmn opgh hopg hipd dhi 


The bit-rearranging part of the program of Figure 7-4 requires 480 
instructions (80 swaps at six instructions each). The revised program, 
using rotate instructions, requires 80 swaps at four instructions each, 
plus 80 rotate instructions (16 - 5) for the first five stages, plus a final 
31 rotate instructions, for a total of 431 instructions. The prolog and 
epilog code would be unchanged, so using rotate instructions in this 
way saves 49 instructions. 


There is another quite different method of transposing a bit matrix: 
apply three shearing transformations [GLS1]. If the matrix is nxn, the 
steps are (1) rotate row i to the right i bit positions, (2) rotate column 
j upwards (j + 1) mod n bit positions, (3) rotate row i to the right (i 
+ 1) mod n bit positions, and (4) reflect the matrix about a horizontal 
axis through the midpoint. To illustrate, for a 4x4-bit matrix: 


abed abcd hipa dhip aeim 
efgh — hefg _ korg Se cgko =" bftjn 
ijkl “ klij ` nbfj ` bfjn ` egko 
mnop nopm aeim acim dhip 


This method is not quite competitive with the others because step (2) 
is costly. (To do it at reasonable cost, rotate upwards all columns that 
rotate by n/2 or more bit positions by n/2 bit positions [these are 
columns n/2 - 1 through n - 2], then rotate certain columns upwards 
n/4 bit positions, and so on.) Steps 1 and 3 require only n - 1 
instructions each, and step 4 requires no instructions at all if the 
results are simply stored to the appropriate locations. 


If an 8x8-bit matrix is stored in a 64-bit word in the obvious way (top 
row in the most significant eight bits, and so on), then the matrix 
transpose operation is equivalent to three outer perfect shuffles or 
unshuffles [GLS1]. This is a very good way to do it if your machine 
has shuffle or unshuffle as a single instruction, but it is not a good 
method on a basic RISC machine. 


7- 4 Compress, or Generalized Extract 


The APL language includes an operation called compress, written B/V, 
where B is a Boolean vector and V is vector of the same length as B, 
with arbitrary elements. The result of the operation is a vector 
consisting of the elements of V for which the corresponding bit in B is 
1. The length of the result vector is equal to the number of 1's in B. 


Here we consider a similar operation on the bits of a word. Given a 
mask m and a word x, the bits of x for which the corresponding mask 
bit is 1 are selected and moved ("compressed") to the right. For 
example, if the word to be compressed is (where each letter denotes a 
single bit) 


abcd efgh ijkl mnop qrst uvwx yzAB CDEF, 
and the mask is 
0000 1111 0011 0011 1010 1010 0101 0101, 
then the result is 
0000 0000 0000 0000 efgh klop qsuw ZBDF. 


This operation might also be called generalized extract, by analogy 
with the extract instruction found on many computers. 


We are interested in code for this operation with minimum worst-case 
execution time, and offer the simple loop of Figure 7-5 as a straw man 
to be improved upon. This code has no branches in the loop, and it 
executes in 260 instructions worst case, including the subroutine 
prolog and epilog. 


Figure 7-5 A simple loop for the compress operation. 


unsigned compress(unsigned x, unsigned m) { 


unsigned r, s, b; // Result, shift, mask bit. 


do { 


b =m& 1; 
r=r | ((x & by << s); 
s = s + b; 


x = x >> 1; 


m m >> 1; 


} while (m != 0); 
return r} 


} 


It is possible to improve on this by repeatedly using the parallel prefix 
method (see page 75) with the exclusive or operation [GLS1]. We will 
denote the parallel prefix operation by PP-XOR. The basic idea is to 
first identify the bits of argument x that are to be moved right an odd 
number of bit positions, and move those. (This operation is simplified 
if x is first anded with the mask, to clear out irrelevant bits.) Mask bits 
are moved in the same way. Next, we identify the bits of x that are to 
be moved an odd multiple of 2 positions (2, 6, 10, and so on), and we 
then move these bits of x and the mask. Next, we identify and move 
the bits that are to be moved an odd multiple of 4 positions, then 
those that move an odd multiple of 8, and then those that move 16 bit 
positions. 


Because this algorithm, believed to be original with [GLS1], is a bit 
difficult to understand, and because it is perhaps surprising that 
something along these lines can be done at all, we will describe its 
operation in some detail. Suppose the inputs are 


x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF, 
m = 1000 1000 1110 0000 0000 1111 0101 0101, 
1 1 111 
9 6 333 4444 32 10 


where each letter in x represents a single bit (with value 0 or 1). The 
numbers below each 1-bit in the mask m denote how far the 
corresponding bit of x must move to the right. This is the number of 


0's in m to the right of the bit. As mentioned above, it is convenient to 
first clear out the irrelevant bits of x, giving 


x = a000 e000 ijkO 0000 0000 uvwx 0z0B ODOF. 


The plan is to first determine which bits move an odd number of 
positions (to the right), and move those one bit position. Recall that 
the PP-XOR operation results in a 1-bit at each position where the 
number of 1's at and to the right of that position is odd. We wish to 
identify those bits for which the number of 0's strictly to the right is 
odd. This can be done by computing mk = ~m << 1 and performing 
PP-XOR on the result. This gives 


mk = 1110 1110 0011 1111 1110 0001 0101 0100, 


1010 0101 1110 1010 1010 0000 1100 1100. 


mp 


Observe that mk identifies the bits of m that have a 0 immediately to 
the right, and mp sums these, modulo 2, from the right. Thus, mp 
identifies the bits of m that have an odd number of 0's to the right. 


The bits that will be moved one position are those that are in positions 
that have an odd number of 0's strictly to the right (identified by mp) 

and that have a 1-bit in the original mask. This is simply mv = mp & 

m: 


mv = 1000 0000 1110 0000 0000 0000 0100 0100. 
These bits of m may be moved with the assignment 

“ mv) | (mv >> 1); 

and the same bits of x may be moved with the two assignments 
t = x & mv; 

x= (x^t) | Œ > 1); 


(Moving the bits of m is simpler because all the selected bits are 1's.) 
Here the exclusive or is turning off bits known to be 1 in mand x, 
and the or is turning on bits known to be O in mand x. The 
operations could also, alternatively, both be exclusive or, or subtract 
and add, respectively. The results, after moving the bits selected by 
mv right one position, are: 


m = 0100 1000 0111 0000 0000 1111 0011 0011, 


x = 0a00 e000 Oijk 0000 0000 uvwx 00zB OODF. 


Now we must prepare a mask for the second iteration, in which we 
identify bits that are to move an odd multiple of 2 positions to the 
right. Notice that the quantity mk & ~mp identifies those bits that 
have a 0 immediately to the right in the original mask m, and those 
bits that have an even number of 0's to the right in the original mask. 
These properties apply jointly, although not individually, to the 
revised mask m. (That is to say, mk identifies all the positions in the 
revised mask m that have a 0 to the immediate right and an even 
number of O's to the right.) This is the quantity that, if summed from 
the right with PP-XOR, identifies those bits that move to the right an 
odd multiple of 2 positions (2, 6, 10, and so on). Therefore, the 
procedure is to assign this quantity to mk and perform a second 
iteration of the above steps. The revised value of mk is 


mk = 0100 1010 0001 0101 0100 0001 0001 0000. 


A complete C function for this operation is shown in Figure 7-6. It 
does the job in 127 basic RISC instructions (constant), including the 
subroutine prolog and epilog. Figure 7-7 shows the sequence of values 
taken on by certain variables at key points in the computation, with 
the same inputs that were used in the discussion above. Observe that a 
by-product of the algorithm, in the last value assigned to m, is the 
original m with all its 1-bits compressed to the right. 


Figure 7.6 Parallel prefix method for the compress operation. 


unsigned compress(unsigned x, unsigned m) { 


unsigned mk, mp, mv, t; 


int i; 
x= X & m; // Clear irrelevant bits. 
mk = ~m << 1; // We will count 0's to right. 


mp = mk ^ (mk << 1); // Parallel prefi> 


mv = mp & m; // Bits to move. 
m =m ^my | (mv >> (1 << i)); // Compress m. 
t = x & mv; 

x=x^t | (t >> (1 << i)); // Compress x. 


mk = mk & ~mp; 
} 


return x; 


Figure 7.7 Operation of the parallel prefix method for the compress 
operation. 


x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF 
m = 1000 1000 1110 0000 0000 1111 0101 0101 


x = a000 e000 ijk0 0000 0000 uvwx 0z0B ODOF 


i = 0, mk = 1110 1110 0011 1111 1110 0001 0101 0100 
After PP, mp = 1010 0101 1110 1010 1010 0000 1100 1100 


mv = 1000 0000 1110 0000 0000 0000 0100 0100 


m = 0100 1000 0111 0000 0000 1111 0011 0011 


x = 0a00 e000 Oijk 0000 0000 uvwx 00zB OODF 


i = 1, mk = 0100 1010 0001 0101 0100 0001 0001 0000 


After PP, 


2, 


After PP, 


3, 


After PP, 


4, 


After PP, 


We calculate that the algorithm of Figure 7-6 would execute in 169 


x 


1100 


0100 


0001 


000a 


0000 


0000 


0000 


0001 


000a 


0000 


0000 


0000 


0001 


000a 


0000 


1111 


0001 


0000 


0000 


0110 


0000 


1000 


e000 


1000 


0111 


0000 


1000 


e000 


1000 


0111 


0000 


1000 


e000 


1000 


1000 


1000 


0000 


0000 


0000 
0000 
0111 


Oijk 


0001 
1111 
0111 
0000 


0000 


0000 
1111 
0000 
0000 


0000 


0000 
0000 
0000 
0000 


0000 


1100 
0000 
0000 


0000 


0001 
0000 
0000 
0111 


Oijk 


0001 
1111 
0111 
0000 


0000 


0000 
0000 
0000 
0000 


0000 


1100 


0000 


0000 


0000 


0000 


1111 


0000 


0000 


0000 


0000 


0000 


0000 


0000 


0000 


0000 


0000 


0000 


0001 


000a 


0000 
0000 
11141 


UvVWwx 


0001 
1111 
1111 
0000 


0000 


0000 
0000 
0000 
0111 


Oijk 


0000 
0000 
0000 


1111 


1111 


0011 


0000 


0000 


0000 


0000 


0000 


1111 


UVWX 


0000 


0000 


0000 


1111 


UVWX 


0000 


0000 


0000 


1111 


0000 


0000 


ale ie a 


ZBDF 


0000 


0000 


0000 


1111 


ZBDF 


0000 


0000 


0000 


1111 


ZBDF 


0000 


0000 


0000 


ae be a 


eijk uvwx zBDF 


instructions on a 64-bit basic RISC, as compared to 516 (worst case) 
for the algorithm of Figure 7-5. 


The number of instructions required by the algorithm of Figure 7-6 
can be reduced substantially if the mask m is a constant. This can 
occur in two situations: (1) a call to "compress (x, m)" occurs in a 
loop, in which the value of m is not known but it is a loop constant, 
and (2) the value of mis known and the code for compress is 
generated in advance, perhaps by a compiler. 


Notice that the value assigned to x in the loop in Figure 7-6 is not 
used in the loop for anything other than the assignment to x. And, x 
is dependent only on itself and variable mv. Therefore, the subroutine 
can be coded with all references to x deleted, and the five values 
computed for mv can be saved in variables mv0, mv1,..., mv4. 
Then, in situation (1) the function without references to x can be 
placed outside the loop in which "compress (x, m)" occurs, and the 
following statements can be placed in the loop: 


x = Xx & m; 

t =x & mvO; x=x^t (t >> 1); 
t =x & mvl; x=x^t (t >> 2); 
t =x & mv2; x=x%* t (t >> 4); 
t= x &€mv3; x=x%7*t (t >> 8); 
t =x &€mv4; x=x%* t (t >> 16); 


This is only 21 instructions in the loop (the loading of the constants 
can be placed outside the loop), a considerable improvement over the 
127 required by the full subroutine of Figure 7-7. 


In situation (2), in which the value of m is known, the same sort of 
thing can be done, and further optimization may be possible. It might 
happen that one of the five masks is 0, in which case one of the five 
lines shown above can be omitted. For example, mask m1 is 0 if it 
happens that no bit moves an odd number of positions, and m4 is 0 if 
no bit moves more than 15 positions, and so on. 


As an example, for 
m = 0101 0101 0101 0101 0101 0101 0101 0101, 
the calculated masks are 


mvO = 0100 0100 0100 0100 0100 0100 0100 0100 


mvl = 0011 0000 0011 0000 0011 0000 0011 0000 


mv2 = 0000 1111 0000 0000 0000 1111 0000 0000 


0000 0000 1111 1111 0000 0000 0000 0000 


mv3 


mv4 = 0000 0000 0000 0000 0000 0000 0000 0000 


Because the last mask is 0, in the compiled code situation this 
compression operation is done in 17 instructions (not counting the 
loading of the masks). This is not quite as good as the code shown for 
this operation on page 108 (13 instructions, not counting the loading 
of masks), which takes advantage of the fact that alternate bits are 
being selected. 


Using Insert and Extract 


If your computer has the insert instruction, preferably with immediate 
values for the operands that identify the bit field in the target register, 
then in the compiled situation insert can often be used to do the 
compress operation with fewer instructions than the methods 
discussed above. Furthermore, it doesn't tie up registers holding the 
masks. 


The target register is initialized to 0, and then, for each contiguous 
group of 1's in the mask m, variable x is shifted right to right-justify 
the next field, and the insert instruction is used to insert the bits of x 
in the appropriate place in the target register. This does the operation 
in 2n + 1 instructions, where n is the number of fields (groups of 
consecutive 1's) in the mask. The worst case is 33 instructions, 
because the maximum number of fields is 16 (which occurs for 
alternating 1's and 0's). 


An example in which the insert method uses substantially fewer 
instructions is m = 0x0010084A. Compressing with this mask 
requires moving bits 1, 2, 4, 8, and 16 positions. Thus, it takes the full 
21 instructions for the parallel prefix method, but only 11 instructions 
for the insert method (there are five fields). A more extreme case is m 
= 0x80000000. Here a single bit moves 31 positions, requiring 21 
instructions for the parallel prefix method, but only three instructions 
for the insert method and only one instruction (shift right 31) if you 
are not constrained to any particular scheme. 


You can also use the extract instruction in various simple ways to do 
the compress operation with a known mask in 3n - 2 instructions, 


where n is the number of fields in the mask. 


Clearly, the problem of compiling optimal code for the compress 
operation with a known mask is a difficult one. 


Compress Left 


To compress bits to the left, obviously you can reverse the argument 

x and the mask, compress right, and reverse the result. Another way is 
to compress right and then shift left by pop(m ). These might be 
satisfactory if your computer has an instruction for bit reversal or 
population count, but if not, the algorithm of Figure 7-6 is easily 
adapted: Just reverse the direction of all the shifts except the two in 
the expressions 1 << i (eight to change). 


7-5 General Permutations, Sheep and Goats Operation 


To do general permutations of the bits in a word, or of anything else, 
a central problem is how to represent the permutation. It cannot be 
represented very compactly. Because there are 32! permutations of the 


log,(32!) | = 118 bits 
bits in a 32-bit word, at least [ Eal ] “or three 
words plus 22 bits, are required to designate one permutation out of 


the 32!. 


One interesting way to represent permutations is closely related to the 
compression operations discussed in Section 7-4 [GLS1]. Start with the 
direct method of simply listing the bit position to which each bit 
moves. For example, for the permutation done by a rotate left of four 
bit positions, the bit at position 0 (the least significant bit) moves to 
position 4, 1 moves to 5, ..., 31 moves to 3. This permutation can be 
represented by the vector of 32 5-bit indexes: 


00100 


00101 


11111 
00000 
00001 
00010 
00011 


Treating that as a bit matrix, the representation we have in mind is its 
transpose, except reflected about the off diagonal so the top row 
contains the least significant bits and the result uses little-endian bit 
numbering. This we store as five 32-bit words in array p: 


p[0] = 1010 1010 1010 1010 1010 1010 1010 1010 


p[l] 1100 1100 1100 1100 1100 1100 1100 1100 


p[2] = 0000 1111 0000 1111 0000 1111 0000 1111 


p[3] = 0000 1111 1111 0000 0000 1111 1111 0000 


p[4] = 0000 1111 1111 1111 1111 0000 0000 0000 


Each bit of p[0] is the least significant bit of the position to which 
the corresponding bit of x moves, each bit of p[1] is the next more 
significant bit, and so on. This is similar to the encoding of the masks 
denoted by mv in the previous section, except that mv applies to 
revised masks in the compress algorithm, not to the original mask. 


The compression operation we need compresses to the left all bits 
marked with 1's in the mask, and compresses to the right all bits 
marked with 0's. [1] This is sometimes called the "sheep and goats" 
operation (SAG), or "generalized unshuffle." It can be calculated with 


[1] If big-endian bit numbering is used, compress to the left all bits 
marked with 0's, and to the right all bits marked with 1's. 


SAG (x, m) = compress_left(x, m) | compress (x, ~m). 


With SAG as a fundamental operation, and a permutation p described 
as above, the bits of a word x can be permuted by p in the following 
15 steps: 


x = SAG(x, p[0]); 
p[1] = SAG(p[1], p[0]); 
p[2] = SAG(p[2], p[0]); 
p[3] = SAG(p[3], p[0]);} 
p[4] = SAG(p[4], p[0]); 
x = SAG(x, p[1]); 


p[2] = SAG(p[2], p[1]); 


p[3] = SAG(p[3], p[1]); 


p[4] = SAG(p[4], p[1]); 


x = SAG(x, p[2]); 


P[3] = SAG(p[3], pl2]); 


p[4] = SAG(p[4], pl2]); 


x = SAG(x, p[3]); 
p[4] = SAG(p[4], p[3]); 
x = SAG(x, pl4]); 


In these steps, SAG is used to perform a stable binary radix sort. Array 
p is used as 32 5-bit keys to sort the bits of x. In the first step, all bits 
of x for which p[0] = 1 are moved to the left half of the resulting 
word, and all those for which p[0] = 0 are moved to the right half. 
Other than this, the order of the bits is not changed (that is, the sort is 
"stable"). Then all the keys that will be used for the next round of 
sorting are similarly sorted. The sixth line is sorting x based on the 
second least significant bit of the key, and so on. 


Similarly to the situation of compressing, if a certain permutation p is 
to be used on a number of words x, then a considerable savings 
results by precomputing most of the steps above. The permutation 
array is revised to 


Pll] = SAG(p[1], p[0]); 

p[2] = SAG(SAG(p[2], p[0]), pl1]); 

p[3] = SAG(SAG(SAG(p[3], p[0]), pll]), p[2]); 

p[4] = SAG(SAG(SAG(SAG(p[4], p[0]), p[1]), pl2]), pl3]); 


and then each permutation is done with 
x = SAG(x, p[0]); 
x = SAG(x, p[1]); 
x = SAG(x, p[2]); 


x = SAG(x, p[3]); 


x = SAG(x, p[4]); 


A more direct (but perhaps less interesting) way to do general 


permutations of the bits in a word is to represent a permutation as a 
sequence of 32 5-bit indexes. 


The kth index is the bit number in the source from which the kth bit 
of the result comes. (This is a "comes from" list, whereas the SAG 
method uses a "goes to" list.) These could be packed six to a 32-bit 
word, thus requiring six words to hold all 32 bit indexes. An 
instruction can be implemented in hardware such as 


bitgather Rt,Rx,Ri, 


where register Rt is a target register (and also a source), register Rx 
contains the bits to be permuted, and register Ri contains six 5-bit 
indexes (and two unused bits). The operation of the instruction is 


Ee (<6) | xX) x; AAA 


In words, the contents of the target register are shifted left six bit 
positions, and six bits are selected from word x and placed in the 
vacated six positions of t. The bits selected are given by the six 5-bit 
indexes in word i, taken in left-to-right order. The bit numbering in 
the indexes could be either little- or big-endian, and the operation 
would probably be as described for either type of machine. 


To permute a word, use a sequence of six such instructions, all with 
the same Rt and Rx, but different index registers. In the first index 
register of the sequence, only indexes i, and is are significant, as the 
bits selected by the other four indexes are shifted out of the left end of 
Rt. 


An implementation of this instruction would most likely allow index 
values to be repeated, so the instruction can be used to do more than 
permute bits. It can be used to repeat any selected bit any number of 
times in the target register. The SAG operation lacks this generality. 


It is not unduly difficult to implement this as a fast (e.g., one cycle) 
instruction. The bit selection circuit consists of six 32:1 MUX's. If these 
are built from five stages of 2:1 MUX's in today's technology (6 : 31 = 
186 MUX's in all), the instruction would be faster than a 32-bit add 
instruction [MD]. 


Permuting bits has applications in cryptography, and the closely 
related operation of permuting subwords (e.g., permuting the bytes in 


a word) has applications in computer graphics. Both of these 
applications are more likely to deal with 64-bit words, or possibly 
with 128, than with 32. The SAG and bitgather methods apply with 
obvious changes to these larger word sizes. 


To encrypt or decrypt a message with the Data Encryption Standard 
(DES) algorithm requires a large number of permutation-like 
mappings. First, key generation is done, once per session. This 
involves 17 permutation-like mappings. The first, called "permuted 
choice 1," maps from a 64-bit quantity to a 56-bit quantity (it selects 
the 56 non-parity bits from the key and permutes them). This is 
followed by 16 permutation-like mappings from 56 bits to 48 bits, all 
using the same mapping, called "permuted choice 2." 


Following key generation, each block of 64 bits in the message is 
subjected to 34 permutation-like operations. The first and last 
operations are 64-bit permutations, one being the inverse of the other. 
There are 16 permutations with repetitions that map 32-bit quantities 
to 48 bits, all using the same mapping. Finally, there are 16 32-bit 
permutations, all using the same permutation. The total number of 
distinct mappings is six. They are all constants and are given in [DES]. 


DES is obsolete, as it was proved to be insecure in 1998 by the 
Electronic Frontier Foundation, using special hardware. The National 
Institute of Standards and Technology (NIST) has endorsed a 
temporary replacement called Triple DES, which consists of DES run 
serially three times on each 64-bit block, each time with a different 
key (that is, the key length is 192 bits, including 24 parity bits). Hence 
it takes three times as many permutation operations as does DES to 
encrypt or decrypt. 


However, the "permanent" replacement for DES and Triple DES, the 
Advanced Encryption Standard (previously known as the Rijndael 
algorithm [AES]), involves no bit-level permutations. The closest it 
comes to a permutation is a simple rotation of 32-bit words by a 
multiple of 8-bit positions. Other encryption methods proposed or in 
use generally involve far fewer bit-level permutations than DES. 


To compare the two permutation methods discussed here, the 
bitgather method has the advantages of (1) simpler preparation of the 
index words from the raw data describing the permutation, (2) 
simpler hardware, and (3) more general mappings. The SAG method 
has the advantages of (1) doing the permutation in five rather than six 
instructions, (2) having only two source registers in its instruction 
format (which might fit better in some RISC architectures), (3) scaling 


better to permute a doubleword quantity, and (4) permuting subwords 
more efficiently. 


Item (3) is discussed in [LSY]. The SAG instruction allows for doing a 
general permutation of a two-word quantity with two executions of 
the SAG instruction, a few basic RISC instructions, and two full 
permutations of single words. The bitgather instruction allows for 
doing it by executing three full permutations of single words plus a 
few basic RISC instructions. This does not count preprocessing of the 
permutation to produce new quantities that depend only on the 
permutation. We leave it to the reader to discover these methods. 


Regarding item (4), to permute, for example, the four bytes of a word 
with bitgather requires executing six instructions, the same as for a 
general bit permutation by bitgather. But with SAG it can be done in 
only two instructions, rather than the five required for a general bit 
permutation by SAG. The gain in efficiency applies even when the 
subwords are not a power of 2 in size; the number of steps required is 


low. n| 
[ m2 where n is the number of subwords, not counting a 
possible non-participating group of bits that stays at one end or the 


other. 


[LSY] discusses the SAG and bitgather instructions (called "GRP" and 
"PPERM," respectively), other possible permutation instructions based 
on networks, and permuting by table lookup. 


7-6 Rearrangements and Index Transformations 


Many simple rearrangements of the bits in a computer word 
correspond to even simpler transformations of the coordinates, or 
indexes, of the bits [GLS1]. These correspondences apply to 
rearrangements of the elements of any one-dimensional array, 
provided the number of array elements is an integral power of 2. For 
programming purposes, they are useful primarily when the array 
elements are a computer word or larger in size. 


As an example, the outer perfect shuffle of the elements of an array A 
of size eight, with the result in array B, consists of the following 
moves: 


Ay > Bo; A, > B,; A, > B4; A, > Bs; 


A, >B 3 As: => B3; A — B;; A- —> B3; 
l d £ 6 : 


Each B-index is the corresponding A-index rotated left one position, 
using a 3-bit rotator. The outer perfect unshuffle is of course 
accomplished by rotating right each index. Some similar 
correspondences are shown in Table 7-1. Here n is the number of 
array elements, "Isb" means least significant bit, and the rotations of 
indexes are done with a logon-bit rotator. 


Table 7-1. Rearrangements and Index Transformations 
Index Transformation 
Array Thitebe- te BigaddiierBthteNnm bering 
Reneptament 
Bxtchligiverorenétialazeonstantsal (page 102) 
Ralik te Ginko doq@atinions 


Rotatp ébefobnni Sx} biitreeapoisi thehd in a 64-bit word 
REVerssditmble 


Chapter 8. Multiplication 


Multiword Multiplication 
High-Order Half of 64-Bit Product 
High-Order Product Signed from/to Unsigned 


Multiplication by Constants 


8-1 Multiword Multiplication 


This may be done with, basically, the traditional grade-school method. 
Rather than develop an array of partial products, however, it is more 
efficient to add each new row, as it is being computed, into a row that 
will become the product. 


If the multiplicand is m words, and the multiplier is n words, then the 
product occupies m + n words (or fewer), whether signed or 
unsigned. 


In applying the grade-school scheme, we would like to treat each 32- 
bit word as a single digit. This works out well if an instruction that 
gives the 64-bit product of two 32-bit integers is available. 
Unfortunately, even if the machine has such an instruction, it is not 
readily accessible from most high-level languages. In fact, many 
modern RISC machines do not have this instruction in part because it 
isn't accessible from high-level languages and thus would not often be 
used. (Another reason is that the instruction would be one of a very 
few that give a two-register result.) 


Our procedure is shown in Figure 8-1. It uses halfwords as the "digits." 
Parameter w gets the result, and u and v are the multiplier and 
multiplicand, respectively. Each is an array of halfwords, with the first 
halfword (w[0], u[0], and v[0]) being the least significant digit. 
This is "little-endian" order. Parameters m and n are the number of 
halfwords in u and v, respectively. 


The picture below may help in understanding. There is no relation 
between mand n; either may be the larger. 


The procedure follows Algorithm M of [Knu2, sec. 4.3.1], but is coded 
in C and modified to perform signed multiplication. Observe that the 
assignment to t in the upper half of Figure 8-1 cannot overflow, 
because the maximum value that could be assigned to t is (216 - 1)2 
+ 2(216-1) = 232-1, 


Multiword multiplication is simplest for unsigned operands. In fact, 
the code of Figure 8-1 performs unsigned multiplication if the 
"correction" steps (the lines between the three-line comment and the 
"return" statement) are omitted. An unsigned version can be extended 
to signed in three ways: 


Figure 8-1 Multiword integer multiplication, signed. 

void mulmns (unsigned short w[], unsigned short u[], 
unsigned short v[], int m, int n) { 
unsigned int k, t, b; 


int i, j; 


for (i = 0; i < m} itt) 
w[i] = 0; 
for (j= 0; j < n; j++) { 
k = 0; 
for (i = 0; i < m; i++) { 
t = ufil*v[j] + w[i + jl + k; 
w[i + j] = t; // (I.e., t & OxFFFF). 


// Now w[] has the unsigned product. Correct by 


// subtracting v*2**16m if u < 0, and 


// subtracting u*2**16n if v < 0. 


if ((short)u[m - 1] < 0) { 


p -=50; // Initialize borrow. 


if ((short)v[n - 1] < 0) { 


b = 0; 

for (i = 0; i < m; i++) { 
t = w[i + n] - u[i] - b; 
w[i +n] = t; 


return; 


1. Take the absolute value of each input operand, perform 
unsigned multiplication, and then negate the result if the input 
operands had different signs. 


2. Perform the multiplication using unsigned elementary 
multiplication except when multiplying one of the high-order 
halfwords, in which case use signed x unsigned or signed x 
signed multiplication. 


3. Perform unsigned multiplication and then correct the result 
somehow. 


The first method requires passing over as many as m + n input 
halfwords, to compute their absolute value. Or, if one operand is 
positive and one is negative, the method requires passing over as 
many as max(m, n) + m + nhalfwords, to complement the negative 
input operand and the result. Perhaps more serious, the algorithm 
would alter its inputs (which we assume are passed by address), which 
may be unacceptable in some applications. Alternatively, it could 
allocate temporary space for them, or it could alter them and later 
change them back. All these alternatives are unappealing. 


The second method requires three kinds of elementary multiplication 
(unsigned x unsigned, unsigned x signed, and signed x signed) and 
requires sign extension of partial products on the left, with 0's or 1's, 
making each partial product take longer to compute and add to the 
running total. 


We choose the third method. To see how it works, let u and v denote 
the values of the two signed integers being multiplied, and let them be 
of lengths M and N bits, respectively. Then the steps in the upper half 
of Figure 8-1 erroneously interpret u as an unsigned quantity, having 
value u + 2Muņ -1, where uy, -1 is the sign bit of u. That is, uy-1 = 1 
if u is negative, and uy-1 = 0 otherwise. Similarly, the program 
interprets v as having value v + 2Nuy -1. 


The program computes the product of these unsigned numbers—that 
is, it computes 


aM + 20y = uv +2M + 2Ny aM +N : 
(ut 20 yy Vt 20 Vy) = uvt 2 uy V+ 20 vy Ut 20 Uy Vy- 


To get the desired result (uv), we must subtract from the unsigned 
product the value 2Muy.; v + 2Nvwy -1 u. There is no need to subtract 
the term 2M + Nuy -1 VN- 1, because we know that the result can be 
expressed in M + N bits, so there is no need to compute any product 
bits more significant than bit position M + N - 1. These two 
subtractions are performed by the steps below the three-line comment 
in Figure 8-1. They require passing over a maximum of m + n 
halfwords. 


It might be tempting to use the program of Figure 8-1 by passing it an 
array of fullword integers—that is, by "lying across the interface." 


Such a program will work on a little-endian machine, but not on a big- 
endian one. If we had stored the arrays in the reverse order, with 
u[0] being the most significant halfword (and the program altered 
accordingly), the "lying" program would work on a big-endian 
machine, but not on a little-endian one. 


8-2 High-Order Half of 64-Bit Product 


Here we consider the problem of computing the high-order 32 bits of 
the product of two 32-bit integers. This is the function of our basic 
RISC instructions multiply high signed (mulhs) and multiply high 
unsigned (mulhu). 


For unsigned multiplication, the algorithm in the upper half of Figure 
8-1 works well. Rewrite it for the special case m = n = 2, with loops 
unrolled, obvious simplifications made, and the parameters changed 
to 32-bit unsigned integers. 


For signed multiplication, it is not necessary to code the "correction 
steps" in the lower half of Figure 8-1. These can be omitted if proper 
attention is paid to whether the intermediate results are signed or 
unsigned (declaring them to be signed causes the right shifts to be 
sign-propagating shifts). The resulting algorithm is shown in Figure 
8-2. For an unsigned version, simply change all the int declarations 
to unsigned. 


Figure 8-2 Multiply high signed. 
int mulhs(int u, int v) { 


unsigned u0, v0, w0; 


int ul, vl, wl, w2, t; 


u0 = u & OXFFFF; ul = u >> 16; 
vO = v & OxFFFF; vl = v >> 16; 
w0 = u0*v0; 

t = ul*v0 + (w0 >> 16); 

wl = t & OxFFFF; 


w2 = t >> 16; 
wl = u0*vl + wil; 


return ul*vl + w2 + (wl >> 16); 


} 


The algorithm requires 16 basic RISC instructions in either the signed 
or unsigned version, four of which are multiplications. 


8-3 High-Order Product Signed from/to Unsigned 


Assume that the machine can readily compute the high-order half of 
the 64-bit product of two unsigned 32-bit integers, but we wish to 
perform the corresponding operation on signed integers. We could use 
the procedure of Figure 8-2, but that requires four multiplications; the 
procedure to be given [BGN] is much more efficient than that. 


The analysis is a special case of that done to convert Knuth's 
Algorithm M from an unsigned to a signed multiplication routine 
(Figure 8-1). Let x and y denote the two 32-bit signed integers that we 
wish to multiply together. The machine will interpret x as an unsigned 
integer, having the value x + 232x31, where x3) is the most significant 
bit of x (that is, x3; is the integer 1 if x is negative, and 0 otherwise). 
Similarly, y under unsigned interpretation has the value y + 232y3). 


Although the result we want is the high-order 32 bits of xy, the 
machine computes 


(x + 232x,,)(y + 232y3,) = xy + 232( x5 y + yax) + 2%2x3) 95). 


To get the desired result, we must subtract from this the quantity 
232(x31y + y31x) + 264x31y31. Because we know that the result can be 
expressed in 64 bits, we can perform the arithmetic modulo 264. This 
means that we can safely ignore the last term, and compute the signed 
high-order product as shown below (seven basic RISC instructions). 


Equation 1 

p 4+ mulhu(x, y) H! multiply high unsigned instruction. 
fe(xS3l&y fit; = Xay. 

t e (y>31)&x li ta = YzxX. 

pep-t,-t // p = desired result. 


Unsigned from Signed 


The reverse transformation follows easily. The resulting program is 


the same as (1) except with the first instruction changed to multiply 
high signed and the last operation changed to p =p + t1 + t2. 


8-4 Multiplication by Constants 


It is nearly a triviality that one can multiply by a constant with a 
sequence of shift left and add instructions. For example, to multiply x 
by 13 (binary 1101), one can code 


Hexal 
f exe] 


retptt,+x 


where r gets the result. 


In this section, left shifts are denoted by multiplication by a power of 
2, so the above plan is written r 8x + 4x + x, which is intended to 
show four instructions on the basic RISC and most machines. 


What we want to convey here is that there is more to this subject than 
meets the eye. First of all, there are other considerations besides 
simply the number of shift's and add's required to do a multiplication 
by a given constant. To illustrate, below are two plans for multiplying 
by 45 (binary 101101). 


te 4x t,o 4x 
re—xtt f — 8x 
| ee | f e 32x 
rertf re-ijrx 
te ți f, = l+ l, 
rertfr rertt, 


The plan on the left uses a variable t that holds x shifted left by a 
number of positions that corresponds to a 1-bit in the multiplier. Each 


shifted value is obtained from the one before it. This plan has these 
advantages: 


It requires only one working register other than the input x 
and the output r. 


Except for the first two, it uses only 2-address instructions. 
The shift amounts are relatively small. 


The same properties are retained when the plan is applied to any 
multiplier. 


The scheme on the right does all the shift's first, with x as the 
operand. It has the advantage of increased parallelism. On a machine 
with sufficient instruction-level parallelism, the scheme on the right 
executes in three cycles, whereas the scheme on the left, running on a 
machine with unlimited parallelism, requires four. 


In addition to these details, it is nontrivial to find the minimum 
number of operations to accomplish multiplication by a constant, 
where by an "operation" we mean an instruction from a typical 
computer's set of add and shift instructions. In what follows, we 
assume this set consists of add, subtract, shift left by any constant 
amount, and negate. We assume the instruction format is three- 
address. However, the problem is no easier if one is restricted to only 
add (adding a number to itself, and then adding the sum to itself, and 
so on, accomplishes a shift left of any amount), or if one augments the 
set by instructions such as the HP PA-RISC's shift and add instructions. 
(These shift the contents of a register left by one, two, or three 
positions, add it to a second register, and put the result in a third 
register. Thus, it can multiply by 3, 5, or 9 in a single, presumably 
fast, instruction.) We also assume that only the least significant 32 bits 
of the product are wanted. 


The first improvement to the basic binary decomposition scheme 
suggested above is to use subtract to shorten the sequence when the 
multiplier contains a group of three or more consecutive 1-bits. For 
example, to multiply by 28 (binary 11100), we can compute 32x - 4x 
(three instructions) rather than 16x + 8x + 4x (five instructions). On 
two's-complement machines, the result is correct even if the 
intermediate result of 32x overflows and the final result does not. 


To multiply by a constant m with the basic binary decomposition 
scheme (using only shift's and add's) requires 


2pop(m) — 1 -ô 


instructions, where d = 1 if m ends in a 1-bit (is odd), and d = 0 
otherwise. If subtract is also used, it requires 


4o(m)+ 2s(m)-—1-6 


instructions, where g(m) is the number of groups of two or more 
consecutive 1-bits in m, s(m) is the number of "singleton" 1-bits in m, 
and d has the same meaning as before. 


For a group of size 2, it makes no difference which method is used. 


The next improvement is to treat specially groups that are separated 
by a single 0-bit. For example, consider m = 55 (binary 110111). The 
group method calculates this as (64x - 16x) + (8x - x), which requires 
six instructions. Calculating it as 64x - 8x - x, however, requires only 
four. Similarly, we can multiply by binary 110111011 as illustrated by 
the formula 512x - 64x - 4x - x (six instructions). 


The formulas above give an upper bound on the number of operations 
required to multiply a variable x by any given number m. Another 
bound can be obtained based on the size of m in bits—that is, on 


i= | log, m] +1. 


Theorem. Multiplication of a variable x by an n-bit constant m, m = 
1, can be accomplished with at most n instructions of the type add, 


subtract, and shift left by any given amount. 


Proof. (Induction on n.) Multiplication by 1 can be done in 0 
instructions, so the theorem holds for n = 1. For n > 1, ifm ends ina 
0-bit, then multiplication by m can be accomplished by multiplying by 
the number consisting of the left n - 1 bits of m (that is, by m/2), in n 
- 1 instructions, followed by a shift left of the result by one position. 
This uses n instructions altogether. 


If m ends in binary 01, then mx can be calculated by multiplying x by 
the number consisting of the left n - 2 bits of m, in n - 2 instructions, 
followed by a left shift of the result by 2, and an add of x. This 
requires n instructions altogether. 


If m ends in binary 11, then consider the cases in which it ends in 
0011, 0111, 1011, and 1111. Let t be the result of multiplying x by 
the left n - 4 bits of m. If m ends in 0011, then mx = 16t + 2x + x, 
which requires (n - 4) + 4 = n instructions. If m ends in 0111, then 
mx = 16t + 8x - x, which requires n instructions. If m ends in 1111, 
then mx = 16t + 16x - x, which requires n instructions. The 
remaining case is that m ends in 1011. 


It is easy to show that mx can be calculated in n instructions if m ends 
in 001011, 011011, or 111011. The remaining case is 101011. 


This reasoning can be continued, with the "remaining case" always 
being of the form 101010...10101011. Eventually, the size of m will 
be reached, and the only remaining case is the number 101010... 
10101011. This n-bit number contains n/2 + 1 1-bits. By a previous 
observation, it can multiply x with 2(n/2 + 1) - 2 = n instructions. 


Thus, in particular, on a 32-bit machine multiplication by any 
constant can be done in at most 32 instructions, by the method 
described above. By inspection, it is easily seen that for n even, the n- 
bit number 101010...101011 requires n instructions, and for n odd, 
the n-bit number 1010101...010110 requires n instructions, so the 
bound is tight. 


The methodology described so far is not too hard to work out by hand, 
or to incorporate into an algorithm such as might be used in a 
compiler. But such an algorithm would not always produce the best 
code, because further improvement is sometimes possible. This can 
result from factoring the multiplier m or some intermediate quantity 
along the way of computing mx. For example, consider again m = 45 
(binary 101101). The methods described above require six 
instructions. Factoring 45 as 5 - 9, however, gives a four-instruction 
solution: 


fedn+x 


r<ortt 


Factoring may be combined with additive methods. For example, 
multiplication by 106 (binary 1101010) requires seven instructions by 
the additive methods, but writing it as 7-15 + 1 leads to a five- 
instruction solution. 


With factoring, the maximum number of instructions needed to 
multiply by an n-bit constant is, to the writer's knowledge, an open 
problem. For large n it may be less than the bound of n proved above. 
For example, m = OxAAAAAAAB requires 32 instructions without 
factoring, but writing this value as 2-5- 17-257 - 65537 + 1 gives a 
ten-instruction solution. (Ten instructions, however, is probably not 
typical of large numbers. The factorization reflects the simple bit 
pattern of alternate 1's and 0's.) 


This should give an idea of the combinatorics involved in this 
seemingly simple problem. Knuth [Knu2, sec. 4.6.3] discusses the 
closely related problem of computing a™ using a minimum number of 
multiplications. This is analogous to the problem of multiplying by m 
using only addition instructions. A compiler algorithm for computing 
mx is described in [Bern]. 


Chapter 9. Integer Division 


Preliminaries 
Multiword Division 
Unsigned Short Division from Signed Division 


Unsigned Long Division 


9-1 Preliminaries 


This chapter and the following one give a number of tricks and 
algorithms involving "computer division" of integers. In mathematical 
formulas we use the expression x/y to denote ordinary rational 
division, x + y to denote signed computer division of integers 


If F 
(truncating toward 0), and XE to denote unsigned computer 
division of integers. Within C code, x/y of course denotes computer 


division, unsigned if either operand is unsigned, and signed if both 
operands are signed. 


Division is a complex process, and the algorithms involving it are 
often not very elegant. It is even a matter of judgment as to just how 
signed integer division should be defined. Most high-level languages 
and most computer instruction sets define the result to be the rational 
result truncated toward 0. This and two other possibilities are 
illustrated below. 


truncating modulus floor 
TF3 = 2 rem 1 2 rem 1 2 rem 1 
(-7) +3 = 2 rem -1 -3 rem 2 -3 rem 2 
7+ (-3) = -2 rem 1 =2 rem 1 -3 rem -2 
(-7)+(-3) = 2 rem -1 3 rem 2 2 rem -1 


The relation dividend = quotient * divisor + remainder holds for all 
three possibilities. We define "modulus" division by requiring that the 
remainder be nonnegative. [1] We define "floor" division by requiring 
that the quotient be the "floor" of the rational result. For positive 
divisors, modulus and floor division are equivalent. A fourth 
possibility, seldom used, rounds the quotient to the nearest integer. 


[1] I know I will be taken to task for this nomenclature, because there 
is no universal agreement that "modulus" implies "nonnegative." 
Knuth's "mod" operator [Knu1] is the remainder of floor division, 
which is negative (or 0) if the divisor is negative. Several 
programming languages use "mod" for the remainder of truncating 
division. However, in mathematics "modulus" is sometimes used for 
the magnitude of a comple x number (nonnegative), and in 
congruence theory the modulus is generally assumed to be positive. 


One advantage of modulus and floor division is that most of the tricks 
simplify. For example, division by 2" can be replaced by a shift right 
signed of n positions, and the remainder of dividing x by 2" is given 
by the logical and of x and 2" - 1. I suspect that modulus and floor 
division more often give the result you want. For example, suppose 
you are writing a program to graph an integer-valued function, and 
the values range from imin to imax. You want to set up the extremes 
of the ordinate to be the smallest multiples of 10 that include imin 
and imax. Then the extreme values are simply (imin + 10) * 10 and 
((imax + 9) + 10) * 10 if modulus or floor division is used. If 
conventional division is used, you must evaluate something like: 


if (imin >= 0) gmin = (imin/10)*10; 
else gmin = ((imin - 9)/10)*10; 
if (imax >= 0) gmax = ((imax + 9)/10)*10; 
else gmax = (imax/10)*10; 


Besides the quotient being more useful with modulus or floor division 
than with truncating division, we speculate that the nonnegative 
remainder is probably wanted more often than a remainder that can 
be negative. 


It is hard to choose between modulus and floor division, because they 
differ only when the divisor is negative, which is unusual. Appealing 
to existing highlevel languages does not help, because they almost 
universally use truncating division for x/y when the operands are 
signed integers. A few give floating-point numbers, or rational 
numbers, for the result. Looking at remainders, there is confusion. In 
Fortran 90, the MOD function gives the remainder of truncating 
division and MODULO gives the remainder of floor division (which can 
be negative). Similarly, in Common Lisp and ADA, REM is the 
remainder of truncating division, and MOD is the remainder of floor 
division. In PL/I, MOD is always nonnegative (it is the remainder of 
modulus division). In Pascal, A mod B is defined only for B > 0, 
and then it is the nonnegative value (the remainder of either modulus 
or floor division). 


Anyway, we cannot change the world even if we knew how we 
wanted to change it, [2] so in what follows we will use the usual 
definition (truncating) for x + y. 


[2] Some do try. IBM's PL.8 language uses modulus division, and 
Knuth's MMIX machine's division instruction uses floor division 


[MMIX]. 


A nice property of truncating division is that it satisfies 


(-n)+d = n+(-d) = —(n+d), for d#0. 


However, care must be exercised when applying this to transform 
programs, because if n or d is the maximum negative number, -n or -d 
cannot be represented in 32 bits. The operation (-231) + (-1) is an 
overflow (the result cannot be expressed as a signed quantity in two's- 
complement notation), and on most machines the result is undefined 
or the operation is suppressed. 


Signed integer (truncating) division is related to ordinary rational 
division by 


Equation 1 


lard], if d#0,nd=0, 


ned = 
[nd], if d#0,nd<0. 


Unsigned integer division—that is, division in which both n and d are 
interpreted as unsigned integers—satisfies the upper portion of (1). 


In the discussion that follows, we make use of the following 
elementary properties of arithmetic, which we don't prove here. See 
[Knu1] and [GKP] for interesting discussions of the floor and ceiling 
functions. 


Theorem D1. For x real, k an integer, 


Lx] = -[-x] [x] =-|-x] 
x-Il<LxJ<x xs[x]<x+1 
[xJsx<Lx]+1 fx]—l<xsf[x] 
x2kelxj2k xske[x]sk 
x>k»|x]2k x<k=>[x]sk 
xsk>lLxjJsk>x<k+1 x2k=>[x]2k=>x>k-1 


x<k&æl|x]<k x>ke[x]>k 


Theorem D2. For n, d integers, d > 0, 


Ab |” -d+ ‘| aad Hi k +d- |. 

d d d d 

Hakr and Hi [==] 
d d d d 


Theorem D3. For x real, d an integer Æo: 


LLxj/dJ=La/d)] and [[x]/d] =[x/d]. 


Corollary. For a, b real, b Fo, d an integer Fo, 


skell ov ireli 


Theorem D4. For n, d integers, d Æo, and x real, 


k +x = |5] if 0Sx< l 
d d 


n l 
-|, and |-+x|=|2ļ| if- 
d f l k i | 
In the theorems below, rem(n, d) denotes the remainder of n divided 
by d. For negative d, it is defined by rem(n, -d) = rem(n, d), as in 
truncating and modulus division. We do not use rem(n, d) with n < 0. 
Thus, for our use, the remainder is always nonnegative. 


if <*50. 
í 


Theorem D5. For n =o, d Fo, 


and rem(2n + 1, d) = 


(2n, d) 2rem(n,d) or 2rem(n,d)+1 or 
rem(2n, d) = 
2rem(n, d)-— ld], 2rem(n, d)— |d| + 1 


(whichever value is greater than or equal to 0 and less than |d|). 


Theorem D6. For n >o, d Æo, 
rem(2n, 2d) = 2rem(n, d). 


Theorems D5 and D6 are easily proved from the basic definition of 
remainder—that is, that for some integer q it satisfies 


n=gd+rem(n,d) with 0<rem(n, d)< ld], 


provided n =o and d Fo (n and d can be non-integers, but we will 
use these theorems only for integers). 


9-2 Multiword Division 


As in the case of multiword multiplication, multiword division may be 
done by, basically, the traditional grade-school method. The details, 
however, are surprisingly complicated. Figure 9-1 is Knuth's Algorithm 
D [Knu2 sec. 4.3.1], coded in C. The underlying form of division it 


z E 5 
uses is 32416 => 3 2. (Actually, the quotient of these 
underlying division operations is at most 17 bits long.) 


Figure 9-1 Multiword integer division, unsigned, . 


int divmnu (unsigned short q[], unsigned short r[], 
const unsigned short u[], const unsigned short v[], 


int m, int n) { 


const unsigned b = 65536; // Number base (16 bits). 


unsigned short *un, *vn; // Normalized form of u, v. 


unsigned qhat; // Estimated quotient digit. 
unsigned rhat; // A remainder. 
unsigned p; // Product of two digits. 


int S, i; j; t, k; 


if (m <n || n <= 0 || vin-1] == 0) 
return 1; // Return if invalid param. 
if (n == 1) { // Take care of 
k = 0; // the case of a 


for (j =m - 1; j >= 0; j—-) { // single-digit 


q[j] = (k*b + u[j])/v[0]; // divisor here. 


return 0; 


// Normalize by shifting v left just enough so that 
// its high-order bit is on, and shift u left the 
// same amount. We may have to append a high-order 


// digit on the dividend; we do that unconditionally. 


s = nlz(v[n-1]) - 16; // 0 <= s <= 16. 
vn = (unsigned short *)alloca(2*n); 
for (i =n - 1; i > 0; i--) 

vn[i] = (v[i] << s) | (v[i-1] >> 16-s); 
vn[0] = v[0O] << s; 
un = (unsigned short *)alloca(2*(m + 1)); 
un[m] = u[m-1] >> 16-s; 
for (i =m- 1; i > 0; i--) 

un[i] = (u[i] << s) | (u[i-1] >> 16-s); 
un[0] = u[0] << s; 


for (j =m-n; j >= 0; j—-) { // Main loop. 


// Compute estimate ghat of q[j]. 


ghat = (un[jtn]*b + un[jt+n-1])/vn[n-1]; 

rhat = (un[jtn]*b + un[jtn-1]) - qhat*vn[n-1]; 
again: 

if (qhat >= b || qhat*vn[n-2] > b*rhat + un[jtn-2]) 


{ ghat = ghat - 1; 
rhat = rhat + vn[n-1]; 


if (rhat < b) goto again; 


// Multiply and subtract. 


k = 0; 
for (i = 0; i < n; itt) { 
p = ghat*vn[il; 
t = un[itj] - k - (p & OxFFFF); 
un[itj] = t; 
k = (p >> 16) - (t >> 16); 
} 
t = un[j+n] - k; 
un[j+n] = t; 
q[j] = ghat; // Store quotient digit. 
if (t < 0) { // If we subtracted too 


qij3] = q[j] - 1; // much, add back. 


} 
} // End j. 
// If the caller wants the remainder, unnormalize 


// it and pass it back. 


if (r != NULL) { 
for (i = 0; i < n; i++) 
r[i] = (un[i] >> s) | (un[itl] << 16-s); 


return 0; 


} 


The algorithm processes its inputs and outputs a halfword at a time. 
Of course, we would prefer to process a fullword at a time, but it 
seems that such an algorithm would require an instruction that does 


ee D 7... : f 
64232 >3 “= division. We assume here that either the machine 
does not have that instruction or it is hard to access from our high- 


level language. Although we generally assume the machine has 

io À ‘ H P 
32232 => 32 division: for this problem 32216 = 16 
suffices. 


Thus, for this implementation of Knuth's algorithm, the base b is 
65536. See [Knu2] for most of the explanation of this algorithm. 


The dividend u and the divisor v are in "little-endian" order—that is, 


u [0] and v [0] are the least significant digits. (The code works 
correctly on both big-and little-endian machines.) Parameters m and 
n are the number of halfwords in u and v, respectively (Knuth 
defines m to be the length of the quotient). The caller supplies space 
for the quotient q and, optionally, for the remainder r. The space for 
the quotient must be at least m - n + 1 halfwords, and for the 
remainder, n halfwords. Alternatively, a value of NULL can be given 
for the address of the remainder to signify that the remainder is not 
wanted. 


The algorithm requires that the most significant digit of the divisor, 
v[n-1], be nonzero. This simplifies the normalization steps and helps 
to ensure that the caller has allocated sufficient space for the quotient. 
The code checks that v[n-1] is nonzero, and also the requirements 


that n >) and m =. If any of these conditions are violated, it 
returns with an error code (return value 1). 


After these checks, the code performs the division for the simple case 
in which the divisor is of length 1. This case is not singled out for 
speed; the rest of the algorithm requires that the divisor be of length 2 
or more. 


If the divisor is of length 2 or more, the algorithm normalizes the 
divisor by shifting it left just enough so that its high-order bit is 1. The 
dividend is shifted left the same amount, so the quotient is not 
changed by these shifts. As explained by Knuth, these steps are 
necessary to make it easy to guess each quotient digit with good 
accuracy. The number of leading zeros function, nlz(x), is used to 
determine the shift amount. 


In the normalization steps, new space is allocated for the normalized 
dividend and divisor. This is done because it is generally undesirable, 
from the caller's point of view, to alter these input arguments, and 
because it may be impossible to alter them—they may be constants in 
read-only memory. Furthermore, the dividend may need an additional 
high-order digit. C's "alloca" function is ideal for allocating this space. 
It is usually implemented very efficiently, requiring only two or three 
in-line instructions to allocate the space and no instructions at all to 
free it. The space is allocated on the program's stack, in such a way 
that it is freed automatically upon subroutine return. 


In the main loop, the quotient digits are cranked out one per loop 
iteration, and the dividend is reduced until it becomes the remainder. 
The estimate qhat of each quotient digit, after being refined by the 
steps in the loop labelled again, is always either exact or too high by 


1. 


The next steps multiply qhat by the divisor and subtract the product 
from the current remainder, as in the grade school method. If the 
remainder is negative, it is necessary to decrease the quotient digit by 
1 and either re-multiply and subtract or, more simply, adjust the 
remainder by adding the divisor to it. This need be done at most once, 
because the quotient digit was either exact or 1 too high. 


Lastly, the remainder is given back to the caller if the address of 
where to put it is non-null. The remainder must be shifted right by the 
normalization shift amount s. 


The "add back" steps are executed only rarely. To see this, observe that 
the first calculation of each estimated quotient digit qhat is done by 
dividing the most significant two digits of the current remainder by 
the most significant digit of the divisor. The steps in the "again" loop 
amount to refining qhat to be the result of dividing the most 
significant three digits of the current remainder by the most 
significant two digits of the divisor (proof omitted; convince yourself 
of this by trying some examples using b = 10). Note that the divisor 
is greater than or equal to b/2 (because of normalization) and the 
dividend is less than or equal to b times the divisor (because each 
remainder is less than the divisor). 


How accurate is the quotient estimated by using only three dividend 
digits and two divisor digits? Because normalization was done, it can 
be shown to be quite accurate. To see this somewhat intuitively (not a 
formal proof), consider estimating u/v in this way for base ten 
arithmetic. It can be shown that the estimate is always high (or exact). 
Thus, the worst case occurs if truncation of the divisor to two digits 
decreases the divisor by as much as possible in the sense of relative 
error, and truncation of the dividend to three digits increases it by as 
little as possible (which is 0), and if the dividend is as large as 
possible. This occurs for the case 49900...0/5099...9, which we 
estimate by 499/50 = 9.98. The true result is approximately 499/51 


a 
==09. 7843. The difference of 0.1957 reveals that the estimated 
quotient digit and the true quotient digit, which are the floor 


functions of these ratios, will differ by at most 1, and this will occur 
about 20% of the time (assuming the quotient digits are uniformly 
distributed). This in turn means that the "add back" steps will be 
executed about 20% of the time. 


Carrying out this (non-rigorous) analysis for a general base b yields 
the result that the estimated and true quotients differ by at most 2/D. 


For b = 65536, we again obtain the result that the difference between 
the estimated and true quotient digits is at most 1, and this occurs 


dead 
with probability 2/65536 “0.00003. Thus the "add back" steps are 
executed for only about 0.003% of the quotient digits. 


An example that requires the add back step is, in decimal, 4500/501. 
A similar example for base 65536 is Ox7FFF 
800000000000/0x800000000001. 


We will not attempt to estimate the running time of this entire 
program, but simply note that for large m and n, the execution time is 
dominated by the multiply/subtract loop. On a good compiler this will 
compile into about 16 basic RISC instructions, one of which is 
multiply. The "for j" loop is executed n times, and the multiply/ 
subtract loop m - n + 1 times, giving an execution time for this part 
of the program of (15 + mul)n(m -n + 1) cycles, where mul is the 
time to multiply two 16-bit variables. The program also executes m - n 
+ 1 divide instructions and one number of leading zeros instruction. 


Signed Multiword Division 


We do not give an algorithm specifically for signed multiword 
division, but merely point out that the unsigned algorithm can be 
adapted for this purpose as follows: 


1. Negate the dividend if it is negative, and similarly for the 
divisor. 


2. Convert the dividend and divisor to unsigned representation. 
3. Use the unsigned multiword division algorithm. 
4. Convert the quotient and remainder to signed representation. 


5. Negate the quotient if the dividend and divisor had opposite 
signs. 


6. Negate the remainder if the dividend was negative. 


These steps sometimes require adding or deleting a most significant 
digit. For example, assume for simplicity that the numbers are 
represented in base 256 (one byte per digit), and that in the signed 
representation, the high-order bit of the sequence of digits is the sign 
bit. This is much like ordinary two's-complement representation. 
Then, a divisor of 255, which has signed representation OxOOFF, must 


be shortened in step 2 to OxFF. Similarly, if the quotient from step 3 
begins with a 1-bit, it must be provided with a leading 0-byte for 
correct representation as a signed quantity. 


9-3 Unsigned Short Division from Signed Division 


By "short division" we mean the division of one single word by 
another (e.g., 32+ 32 32). It is the form of division provided b 
the "/" operator, when the operands are integers, in C and many other 
high-level languages. C has both signed and unsigned short division, 
but some computers provide only signed division in their instruction 
repertoire. How can you implement unsigned division on such a 
machine? There does not seem to be any really slick way to do it, but 
we offer here some possibilities. 


Using Signed Long Division 


Even if the machine has signed long division (64 + 32 32), 
unsigned short division is not as simple as you might think. In the XLC 


compiler for the IBM RS/6000, it is implemented as illustrated below 
orge (Hea). 

If n <d then g —0 

elseif d = 1 then g&n 

else if d = 1 then g & 1 


else g— (0 || r) + d 


"3l 
The third line is really testing to see if d = 2 "If d is algebraically 
less than or equal to 1 at this point, then because it is not equal to 1 


(from the second line), it must be algebraically less than or equal to 0. 
We don't care about the case d = 0, so for the cases of interest, if the 
test on the third line evaluates to true, the sign bit of d is on, that is, 


d3 251 nod, 


because n cannot exceed 


"Because from the first line it is known that and 


247-1, nid = 1, 


The notation on the fourth line means to form the double-length 
integer consisting of 32 0-bits followed by the 32-bit quantity n, and 
divide it by d. The test for d = 1 (second line) is necessary to ensure 


ns 231, 


that this division does not overflow (it would overflow if 


and then the quotient would be undefined). 


By commoning the comparisons on the second and third lines, [3] the 
above can be implemented in 11 instructions, three of which are 
branches. If it is necessary that the divide be executed when d = 0, to 
get the overflow interrupt, then the third line can be changed to "else 
ifd < 0 then q #1," giving a 12-instruction solution on the 
RS/6000. 


[3] One execution of the RS/6000's compare instruction sets multiple 
status bits indicating less than, greater than, or equal. 


It is a simple matter to alter the above code so that the probable usual 
(2ed< 


=] ...), but the code volume increases slightly. 


31° 
cases a) do not go through so many tests (begin with d 


Using Signed Short Division 


If signed long division is not available, but signed short division is, 
H 
then # +4 can be implemented by somehow reducing the problem to 
the case n, d < 231, and using the machine's divide instruction. If 
u 93 tu 
d22 ‘then # $4 can only be 0 or 1, so this case is easil 
dispensed with. Then, we can reduce the dividend by using the fact 


j oo í "} H 
that the expression ((n 72) + d)x2 approximates ™ td 
error of only 0 or 1. This leads to the following method: 


with an 


i ifd<Othenifn ed then g —0 
1 else gl 

3. else do 

a ge iin $2)+d)x2 

5, re—n-gqd 

" if r >d then ge q+ 


7. end 


A gil 
The test d < 0 on line 1 is really testing to determine if d Z= "If 


ee E.S | 
dz2 "then the largest the quotient could be is (232 - 1) + 231 = 
1, so the first two lines compute the correct quotient. 


Line 4 represents the code shift right unsigned 1, divide, shift left 1. 


wy 3] ie E 
Clearly, / +4 < =" +and at this point d <2 < as well, so these 
quantities can be used in the computer's signed division instruction. (If 


d = 0, overflow will be signaled here.) 


The estimate computed at line 4 is 


g = LLn/2J/dJ]-2 = Ln/(2d)J-2 = a 2d) 
í 


where we have used the corollary of Theorem D3. Line 5 computes the 
remainder corresponding to the estimated quotient. It is 


m nd., 
,- po rem{n, 2d) j 


d 


r=} rem(n, 2d). 


Thus, 0 <, < 2d. Ifr < d, then q is the correct quotient. If r =a, 


then adding 1 to q gives the correct quotient (the program must use 
an unsigned comparison here because of the possibility that r = 931), 
By moving the load immediate of 0 into q ahead of the comparison 


H 
dii d, and coding the assignment q ®=1 in line 2 as a branch to the 
assignment q ®=q + 1 in line 6, this can be coded in 14 instructions 


on most machines, four of which are branches. It is straightforward to 
augment the code to produce the remainder as well: to line 1 append r 
n, to line 2 append r =n - d, and to the "then" clause in line 6 
append r #r - d. (Or, at the cost of a multiply, simply append r =n 
- qd to the end of the whole sequence.) 


An alternative for lines 1 and 2 is 
ifn zd then g + Ù 


else if d< 0 then g = 1, 


which can be coded a little more compactly, for a total of 13 
instructions, three of which are branches. But it executes more 
instructions in what is probably the usual case (small numbers with n 
> d). 


Using predicate expressions, the program can be written 


if d <O then g + (n Sd) 


1. 

2. else do 

5, q+} ((r¥2)+d)x2 
4. re-—n-gd 

5. qoqti(red) 

6. end 


which saves two branches if there is a way to evaluate the predicates 
without branching. On the Compaq Alpha they can be evaluated in 
one instruction (CMPULE); on MIPS they take two (SLTU, XORI). On 
most computers, they can be evaluated in four instructions each (three 
if equipped with a full set of logic instructions), by using the 


fi 
expression for xSy given in "Comparison Predicates" on page 21, 
and simplifying because on line 1 of the program above it is known 


that d3; = 1, and on line 5 it is known that d3ı = 0. The expression 
simplifies to 


nèd =(n&-a(n—d))+31 online |, and 


r3d=(r|—a(r-d))s31 online 5. 


We can get branch-free code by forcing the dividend to be 0 when 


di yi 
d 22 _ ` Then, the divisor can be used in the machine's signed divide 
instruction, because when it is misinterpreted as a negative number, 


the result is set to 0, which is within 1 of being correct. We'll still 
handle the case of a large dividend by shifting it one position to the 


right before the division, and then shifting the quotient one position to 
the left after the division. This gives the following program (ten basic 
RISC instructions): 


| ted S31 
2. n enk al 
5 qo((n'Z2)+d)x2 


m reæn-gd 


s  geq+irżd) 


9-4 Unsigned Long Division 


By "long division" we mean the division of a doubleword by a single 


. . 64 £99 = 39,... ; 
word. For a 32-bit machine, this is we = tS division, with 
the result unspecified in the overflow cases, including division by 0. 


Some 32-bit machines provide an instruction for unsigned long 
division. Its full capability, however, gets little use, because only 


an E 
32432 => 32 division is accessible with most high-level 
languages. Therefore, a computer designer might elect to provide only 


2 u42 oe : 
324 3 d division, and would probably want an estimate of the 
execution time of a subroutine that implements the missing function. 


Here we give two algorithms for providing this missing function. 
Hardware Shift-and-Subtract Algorithms 


As a first attempt at doing long division, we consider doing what the 
hardware does. There are two algorithms commonly used, called 
restoring and nonrestoring division [H&P, sec. A-2; EL]. They are both 
basically "shift-and-subtract" algorithms. In the restoring version, 
shown below, the restoring step consists of adding back the divisor 
when the subtraction gives a negative result. Here x, y, and z are held 
in 32-bit registers. Initially, the double-length dividend is x || y, and 
the divisor is z. We need a single-bit register c to hold the overflow 
from the subtraction. 


do į + 1 to 32 


clix lly — 2(x lly) // Shift left one. 
clix e (cllx)-(0b0 1z) // Subtract (33 bits). 
Yoane H Set one bit of quotient. 


if c then c IIx —(ellx)+(O0b0llz)  // Restore. 
end 


Upon completion, the quotient is in register y and the remainder is in 
register x. 


The algorithm does not give a useful result in the overflow cases. For 
division of the doubleword quantity x || y by 0, the quotient obtained 


is the one's-complement of x, and the remainder obtained is y. In 
ra 932 

particular, 020 = | 

difficult to characterize. 


rem 0. The other overflow cases are 


It might be useful if, for nonzero divisors, the algorithm would give 
the correct quotient modulo 232, and the correct remainder. However, 
the only way to do this seems to be to make the register represented 
by c || x || y above 97 bits long, and do the loop 64 times. This is 


: 2 E432 = 64,... ; : 
doing 62 + ian 64 division. The subtractions would still be 33- 
bit operations, but the additional hardware and execution time make 


this refinement probably not worthwhile. 


This algorithm is difficult to implement exactly in software, because 
most machines do not have the 33-bit register that we have 
represented by c || x. Figure 9-2, however, illustrates a shift-and- 
subtract algorithm that reflects the hardware algorithm to some 
extent. 


The variable t is used for a device to make the comparison come out 
right. We want to do a 33-bit comparison after shifting x || y. If 
the first bit of x is 1 (before the shift), then certainly the 33-bit 
quantity is greater than the divisor (32 bits). In this case, x | t is all 
1's, so the comparison gives the correct result (true). On the other 
hand, if the first bit of x is 0, then a 32-bit comparison is sufficient. 


The code of the algorithm in Figure 9-2 executes in 321 to 385 basic 
RISC instructions, depending upon how often the comparison is true. 
If the machine has shift left double, the shifting operation can be done 
in one instruction, rather than the four used above. This would reduce 
the execution time to about 225 to 289 instructions (we are allowing 
two instructions per iteration for loop control). 


Figure 9-2 Divide long unsigned, shift-and-subtract algorithm. 

unsigned divlu(unsigned x, unsigned y, unsigned z) { 
// Divides (x | | y) by z. 
int i; 


unsigned t; 


t = (int)x >> 31; // All 1's if x(31) 
x = (x << 1) | (y >> 31); // Shift x || y left 
y= y<< 1; // one bit. 


< 
Í 
< 
+ 
z 


} 


return y; // Remainder is x. 


PERE n 
The algorithm in Figure 9-2 can be used to do 32432 => 32 

division by supplying x = 0. The only simplification that results is 
that the variable t can be omitted, as its value would always be 0. 


Below is the nonrestoring hardware division algorithm (unsigned). 
The basic idea is that, after subtracting the divisor z from the 33-bit 
quantity that we denote by c || x, there is no need to add back z if the 
result was negative. Instead, it suffices to add on the next iteration, 
rather than subtract. This is because adding z (to correct the error of 
having subtracted z on the previous iteration), shifting left, and 
subtracting z is equivalent to adding z (2(u + z)-z = 2u + z). The 
advantage to hardware is that there is only one add or subtract 
operation on each loop iteration, and the adder is likely to be the 
slowest circuit in the loop. [4] An adjustment to the remainder is 
needed at the end, if it is negative. (No corresponding adjustment of 
the quotient is required.) 


[4] Actually, the restoring division algorithm can avoid the restoring 
step by putting the result of the subtraction in an additional register, 
and writing that register into x only if the result of the subtraction (33 
bits) is nonnegative. But in some implementations this may require an 
additional register and possibly more time. 


The input dividend is the doubleword quantity x || y, and the divisor 
is z. Upon completion, the quotient is in register y and the remainder 
is in register x. 


c=0 
do i é | to 32 
ife = 0 then do 
ce llx lly — 2(x lly) // Shift lett one. 
c llx —(ellx)-—(0b01)z) X Subtract divisor. 
end 
else do 
cllx lly — 2(x lly) Í} Shift left one. 
ellx<—(cllx)+(Ob0 Iz) // Add divisor. 


end 
Yoo // Set one bit of quotient. 
end 
ifc = Lthenxrex+z // Adjust remainder if negative. 


This does not seem to adapt very well to a 32-bit algorithm. 


The 801 minicomputer (an early experimental RISC machine built by 
IBM) had a divide step instruction that essentially performed the steps 
in the body of the loop above. It used the machine's carry status bit to 
hold c, and the MQ (a 32-bit register) to hold y. A 33-bit adder/ 
subtracter is needed for its implementation. The 801's divide step 
instruction was a little more complicated than the loop above, because 
it performed signed division and it had an overflow check. Using it, a 
division subroutine can be written that consists essentially of 32 
consecutive divide step instructions followed by some adjustments to 
the quotient and remainder to make the remainder have the desired 
sign. 


Using Short Division 


iQ f 
An algorithm for 64 £32 = 32 division can be obtained from the 
multiword division algorithm of Figure 9-1 on page 141, by 


specializing it to the case m = 4, n = 2. Several other changes are 
necessary. The parameters should be fullwords passed by value, rather 
than arrays of halfwords. The overflow condition is different; it occurs 
if the quotient cannot be contained in a single fullword. It turns out 


that many simplifications to the routine are possible. It can be shown 
that the guess ghat is always exact; it is exact if the divisor consists 
of only two halfword digits. This means that the "add back" steps can 
be omitted. If the "main loop" of Figure 9-1 and the loop within it are 
unrolled, some minor simplifications become possible. 


The result of these transformations is shown in Figure 9-3. The 
dividend is in u1 and u0, with u1 containing the most significant 
word. The divisor is parameter v. The quotient is the returned value 
of the function. If the caller provides a non-null pointer in parameter 
r, the function will return the remainder in the word to which r 
points. 


For an overflow indication, the program returns a remainder equal to 
the maximum unsigned integer. This is an impossible remainder for a 
valid division operation, because the remainder must be less than the 
divisor. In the overflow case, the program also returns a quotient 
equal to the maximum unsigned integer, which may be an adequate 
indicator in some cases in which the remainder is not wanted. 


The strange expression (-s >> 31) inthe assignment to u32 is 
supplied to make the program work for the case s = 0 on machines 
that have mod 32 shifts (e.g., Intel x86). 


Experimentation with uniformly distributed random numbers suggests 
that the bodies of the "again" loops are each executed about 0.38 times 
for each execution of the function. This gives an execution time, if the 
remainder is not wanted, of about 52 instructions. Of these 
instructions, one is number of leading zeros, two are divide, and 6.5 
are multiply (not counting the multiplications by b, which are shift's). 
If the remainder is wanted, add six instructions (counting the store of 
r), one of which is multiply. 


What about a signed version of divlu? It would probably be difficult 
to modify the code of Figure 9-3, step by step, to produce a signed 
variant. That algorithm, however, may be used for signed division by 
taking the absolute value of the arguments, running divlu, and then 
complementing the result if the signs of the original arguments differ. 
There is no problem with extreme values such as the maximum 
negative number, because the absolute value of any signed integer has 
a correct representation as an unsigned integer. This algorithm is 
shown in Figure 9-4. 


Figure 9-3 Divide long unsigned, using fullword division instruction. 


unsigned divlu(unsigned ul, 


unsigned *r) { 


const unsigned b 


unsigned unl, un0O, 


vnl, vn0, 


ql, q0, 


un32, un21, 


rhat; 


int s; 


if (ul >= v) { 


if (r != NULL) 


65536; // 


xr = OxXFFFFFFFF; 


return OxFFFFFFFF; } 


s = nlz(v); 
v = v << S} 


vni = v >> 16; 


vn0 = v & OxFFFF; 
un32 = (ul << s) 
un10 = u0 << s; 


unl = un10 >> 16; 


(u0 


// 
// 
// 


unl0,// 


// 
// 


// 
// 
// 
// 


// 
// 
// 
AA 


>> 32 


// 


// 


unsigned u0, unsigned v, 


Number base (16 bits). 
Norm. dividend LSD's. 
Norm. divisor digits. 
Quotient digits. 
Dividend digit pairs. 
A remainder. 


Shift amount for norm. 


If overflow, set rem. 
to an impossible value, 
and return the largest 


possible quotient. 


Normalize divisor. 
Break divisor up into 


two 16-bit digits. 


- s) & (-s >> 31); 


Shift dividend left. 


Break right half of 


un0 = unlO & OxFFFF; // dividend into two digits. 


ql = un32/vni; // Compute the first 

rhat = un32 - ql*vnl; // quotient digit, ql. 
againl: 

if (ql >= b || ql*vn0 > b*rhat + unl) { 


rhat = rhat + vnl; 


if (rhat < b) goto again1;} 


un21 = un32*b + unl - ql*v; // Multiply and subtract. 


q0 = un21/vn1; // Compute the second 

rhat = un21 - q0*vnl1; // quotient digit, q0. 
again2: 

if (q0 >= b || q0*vn0 > b*rhat + un0) { 


q0 = q0 - 1; 
rhat = rhat + vnl; 


if (rhat < b) goto again2;} 


if (r != NULL) // If remainder is wanted, 
žr = (un21*b + und - q0*v) >> s; // return it. 


return ql*b + q0; 


Figure 9-4 Divide long signed, using divide long unsigned. 


int divls(int ul, unsigned u0, int v, int *r) { 


int q, uneg, vneg, diff, borrow; 


uneg = ul >> 31; // -1 if u< 0. 

if (uneg) { // Compute the absolute 
u0 = -u0; // value of the dividend u. 
borrow = (u0 != 0); 
ul = -ul - borrow; } 

vneg = v >> 31; // -1 if v < 0. 

v = (v ^ vneg) - vneg; // Absolute value of v. 


if ((unsigned)ul >= (unsigned)v) goto overflow; 


q = divlu(ul, u0, v, (unsigned *)r); 


diff = uneg ^ vneg; // Negate q if signs of 
q = (q ^ diff) - diff; // u and v differed. 
if (uneg && r != NULL) 
er = aie 
if ((diff ^ q) < 0 && q != 0) { // If overflow, 


overflow: // set remainder 


if (r != NULL) // to an impossible value, 

*r = 0x80000000; // and return the largest 

q = 0x80000000; } // possible neg. quotient. 
return q; 


} 


It is hard to devise really good code to detect overflow in the signed 
case. The algorithm shown in Figure 9-4 makes a preliminary 
determination identical to that used by the unsigned long division 
routine, which ensures that |u/v| < 232. After that, it is necessary only 
to ensure that the quotient has the proper sign or is 0. 


Chapter 10. Integer Division by Constants 


On many computers, division is very time consuming and is to be 
avoided when possible. A value of 20 or more elementary add times is 
not uncommon, and the execution time is usually the same large value 
even when the operands are small. This chapter gives some methods 
for avoiding the divide instruction when the divisor is a constant. 


10-1 Signed Division by a Known Power of 2 


Apparently, many people have made the mistake of assuming that a 
shift right signed of k positions divides a number by 2k, using the 
usual truncating form of division [GLS2]. It's a little more complicated 
than that. The code shown below computes q = n + 2k, for 1 Sx <= 
31 [Hop]. 


shrsi t,n,k-1 Form the integer 

shri t,t,;32=-k 2**k = 1 if n < 0, else 0. 
add t; nyt Add it to n, 

shrsi q,t,k and shift right (signed). 


It is branch-free. It also simplifies to three instructions in the common 
case of division by 2 (k = 1). It does, however, rely on the machine's 
being able to shift by a large amount in a short time. The case k = 31 
does not make too much sense, because the number 23! is not 
representable in the machine. Nevertheless, the code does produce the 
correct result in that case (which is q = -1 ifn = -231 and q = 0 for 
all other n). 


To divide by -2, the above code may be followed by a negate 
instruction. There does not seem to be any better way to do it. 


The more straightforward code for dividing by 2k is 


bge n,label Branch if n >= 0. 
addi n,n,2**k-1 Add 2**k - 1 to n, 
label shrsi n,n,k and shift right (signed). 


This would be preferable on a machine with slow shifts and fast 
branches. 


PowerPC has an unusual device for speeding up division by a power of 
2 [GGS]. The shift right signed instructions set the machine's carry bit 
if the number being shifted is negative and one or more 1-bits are 
shifted out. That machine also has an instruction for adding the carry 
bit to a register, denoted addze. This allows division by any 
(positive) power of 2 to be done in two instructions: 


shrsi gq,n,k 
addze q,q 


A single shrsi of k positions does a kind of signed division by 2k 
that coincides with both modulus and floor division. This suggests that 
one of these might be preferable to truncating division for computers 
and HLL's to use. That is, modulus and floor division mesh with 

shrsi better than does truncating division, permitting a compiler to 
translate the expression n/2 to an shrsi. Furthermore, shrsi 
followed by neg (negate) does modulus division by -2‘, which is a 
hint that maybe modulus division is best. (However, this is mainly an 
aesthetic issue. It is of little practical significance because division by 
a negative constant is no doubt extremely rare.) 


10-2 Signed Remainder from Division by a Known Power 
of 2 


If both the quotient and remainder of n + 2k are wanted, it is simplest 
to compute the remainder r from r = q * 2k - n. This requires only 
two instructions after computing the quotient q: 


shli r,q,k 
sub pean a al 


To compute only the remainder seems to require about four or five 
instructions. One way to compute it is to use the four-instruction 
sequence above for signed division by 2, followed by the two 
instructions shown immediately above to obtain the remainder. This 
results in two consecutive shift instructions that can be replaced by an 
and, giving a solution in five instructions (four if k = 1): 


shrsi t,n,k-1 Form the integer 

shri t,t,32-k 2**k = 1 if n < 0, else 0, 
add tent Add it to n, 

andi t,t,-2**k clear rightmost k bits, 
sub Cen, & and subtract it from n. 


Another method is based on 


né&(2*-1), n20, 
[—((—n) & (24-1)), ned. 


remen, 2*) = 


fons 3l 


To use this, first compute *and then 


r—((abs(a) & (2*-1)) 80-i 


(five instructions) or, fork = 1, since (-n)&1 =n&l, 


re—((n&1)@Fr)-tr 


(four instructions). This method is not very good for k > 1 if the 
machine does not have absolute value (computing the remainder 
would then require seven instructions). 


Still another method is based on 


n&(2*-1), n20, 


rem(n, 24) = 
((2 + 2*— 1) & (2*-1))-(24-1), n <0. 


This leads to 


t<—(nsk-1)5.32—-k 
r {in +e) & (2*-1))-¢ 


(five instructions for k > 1, four for k = 1). 


The above methods all work for 1 <k S31. 


Incidentally, if shift right signed is not available, the value that is 2 - 


1 forn < 0 and 0 for n =o can be constructed from 
fons 31 


ré (t «k)-f,, 


which adds only one instruction. 


10-3 Signed Division and Remainder by Non-Powers of 2 


The basic trick is to multiply by a sort of reciprocal of the divisor d, 
approximately 232/d, and then to extract the leftmost 32 bits of the 
product. The details, however, are more complicated, particularly for 
certain divisors such as 7. 


Let us first consider a few specific examples. These illustrate the code 
that will be generated by the general method. We denote registers as 
follows: 


n - the input integer (numerator) 
M - loaded with a "magic number" 
t - a temporary register 

q - will contain the quotient 


r - will contain the remainder 


Divide by 3 

li M,0x55555556 Load magic number, (2**32+2)/3. 
mulhs q,M,n q = floor (M*n/2**32). 

shri t;n;31 Add 1 to q if 

add G;qrt n is negative. 

muli t,q,3 Compute remainder from 

sub r,n,t r= n = g*3. 


Proof. The multiply high signed operation (mulhs) cannot overflow, 
as the product of two 32-bit integers can always be represented in 64 
bits and mulhs gives the high-order 32 bits of the 64-bit product. 
This is equivalent to dividing the 64-bit product by 232 and taking the 
floor of the result, and this is true whether the product is positive or 


= 


negative. Thus, for n =0 the above code computes 


Now, n < 231, because 23! - 1 is the largest representable positive 
number. Hence the "error" term 2n/(3 - 232) is less than 1/3 (and is 


nonnegative), so by Theorem D4 (page 139) we have = Ln/3 J, 
which is the desired result (Equation (1) on page 138). 


Forn < 0, there is an addition of 1 to the quotient. Hence the code 
computes 


32 + 
qe 25 +2 n |y] = |22n+2n+3.2?| _ [| 22n+2n+1 
3 2% 3.232 3. 232 i 


where we have used Theorem D2. Hence 
n 2n+Í 

q = 3 + a ]° 
3 3.232 


Z < 


For -231 =n =- ], 


The error term is nonpositive and greater than -1/3, so by Theorem 


ra = [n73] *which is the desired result (Equation (1) on page 


This establishes that the quotient is correct. That the remainder is 
correct follows easily from the fact that the remainder must satisfy 


n= gd+r, 


the oe by 3 cannot overflow (because -231/3 Saq S3 - 
1)/3), and the subtract cannot overflow because the result must be in 


the range -2 to +2. 


The multiply immediate can be done with two add's, or a shift and an 
add, if either gives an improvement in execution time. 


On many present-day RISC computers, the quotient can be computed 
as shown above in nine or ten cycles, whereas the divide instruction 
might take 20 cycles or so. 


Divide by 5 


For division by 5, we would like to use the same code as for division 
by 3, except with a multiplier of (232 + 4)/5. Unfortunately, the error 
term is then too large; the result is off by 1 for about 1/5 of the values 


ofn = 930 in magnitude. However, we can use a multiplier of (233 + 
3)/5 and add a shift right signed instruction. The code is 


li M,0x66666667 Load magic number, (2**33+3)/5. 
mulhs q,M,n q = floor (M*n/2**32). 


shrsi q,q,1 


shri t,n,31 Add 1 to q if 

add dr art n is negative. 

muli t,q,5 Compute remainder from 
sub r,n;t r= i= g*5. 


Proof. The mulhs produces the leftmost 32 bits of the 64-bit product, 
and then the code shifts this right by one position, signed (or 
"arithmetically"). This is equivalent to dividing the product by 233 and 


= 


then taking the floor of the result. Thus, for n =0 the code computes 


433 i 
=|? Hal- E 
5 233 5 5-233 


For 0 Sn < 231, the error term 3n/5 - 233 is nonnegative and less 
than 1/5, so by Theorem D4, { = Lins 5 |. 


For n < 0 the above code computes 
27343) no ant 
rgd ad ey ee ar 
5 23 5° 5-28 


The error term is nonpositive and greater than -1/5, so 


g = [r75]. 
That the remainder is correct follows as in the case of division by 3. 


The multiply immediate can be done with a shift left of two and an 
add. 


Divide by 7 


Dividing by 7 creates a new problem. Multipliers of (232 + 3)/7 and 
(233 + 6)/7 give error terms that are too large. A multiplier of (234 + 
5)/7 would work, but it's too large to represent in a 32-bit signed 
word. We can multiply by this large number by multiplying by (234 + 
5)/7 - 232 (a negative number), and then correcting the product by 
inserting an add. The code is 


li M,0x92492493 Magic num, (2**34+5)/7 - 2**32., 
mulhs q,M,n q = floor (M*n/2**32). 

add Q,q,n q = floor (M*n/2**32) +n. 

shrsi q,q,2 q = floor(q/4). 

shri t,n,31 Add 1 to q if 

add qrq, E n is negative. 

muli t,q,7 Compute remainder from 


sub r,n,t r=n- q*7. 


Proof. It is important to note that the instruction "add q, q, n" above 
cannot overflow. This is because q and n have opposite signs, due to 
the multiplication by a negative number. Therefore, this "computer 
arithmetic" addition is the same as real number addition. Hence for n 


=o the above code computes 


ga (ee s) enja = [m set ee |4 | 
7 232 7.232 
Z [3+ Sn | 
7 7.2% 


where we have used the corollary of Theorem D3. 


For 0 Sn < 231, the error term 5n/7 - 234 is nonnegative and less 
than 1/7, so q5 Ln/7 J. 


Forn < 0, the above code computes 


aM 
q = (C H 2m) [en /4)+1= HEA 


The error term is nonpositive and greater than -1/7, so 
g = [r77]. 


The multiply immediate can be done with a shift left of three and a 
subtract. 


10-4 Signed Division by Divisors 22 


At this point you may wonder if other divisors present other problems. 
We see in this section that they do not; the three examples given 


illustrate the only cases that arise (for d 22). 


Some of the proofs are a bit complicated, so to be cautious, the work 
is done in terms of a general word size W. 


Given a word size W >; and a divisor d, 2 Za < 2W-1, we wish to 
find the least integer m and integer p such that 


Equation 1a 


may) |” forQOsa<2"*-!, and 
oF a 


Equation 1b 
Maye qa | 2 for —2"-len<-l, 
on d 


with 0 Zn < 2W and p 2w. 

The reason we want the least integer m is that a smaller multiplier 

may give a smaller shift amount (possibly zero) or may yield code 

similar to the "divide by 5" example, rather than the "divide by 7" 
<= 


example. We must have m —2W -1 so the code has no more 
instructions than that of the "divide by 7" example (that is, we can 


handle a multiplier in the range 2W -1 to 2W-1 by means of the add 
that was inserted in the "divide by 7" example, but we would rather 


not deal with larger multipliers). We must have p 2w because the 
generated code extracts the left half of the product mn, which is 


equivalent to shifting right W positions. Thus, the total right shift is W 
or more positions. 


There is a distinction between the multiplier m and the "magic 
number," denoted M. The magic number is the value used in the 


multiply instruction. It is given by 


iH, ifO<m<2¥-!, 


m2") ifo¥-leme FH, 


M = 


= il, | -mdr 2” | + l = =l, 


Because (1b) must hold for a 
which implies 


Equation 2 


mid 


ap 


Let ne be the largest (positive) value of n such that (ne, d) = d- 1. ne 
exists because one possibility is ne = d - 1. It can be calculated from 


ne = L2¥-'/d Jd-1 = 2¥-! - rem(2"-1, d)- 1 


"ne is one of the 
highest d admissible values of n, so 


Equation 3a 


2W-1 gsn, 5 2¥=1~ 1, 
and clearly 
Equation 3b 


n.2d—l. 


Because (1a) must hold for n = ng, 


mn, | _ |ne | _ te—td—1) 
| Lda d 


or 


mn. fo | 
2? d 


Combining this with (2) gives 


Equation 4 

yr 2Pn, + | 
— z mM — 

d don 


Because m is to be the least integer satisfying (4), it is the next integer 
greater than 2P/d; that is, 


Equation 5 


2" 4 df —rem(?”, d) 


i = 


d 


Combining this with the right half of (4) and simplifying gives 


Equation 6 


W> neid- rem(2”, ey). 


The Algorithm 


Thus, the algorithm to find the magic number M and the shift amount 
s from d is to first compute ne, and then solve (6) for p by trying 
successively larger values. If p < W, set p = W (the theorem below 


=w 


shows that this value of p also satisfies (6)). When the smallest p = 


satisfying (6) is found, m is calculated from (5). This is the smallest 
possible value of m, because we found the smallest acceptable p, and 
from (4) clearly smaller values of p yield smaller values of m. Finally, 
s = p- Wand Mis simply a reinterpretation of m as a signed integer 
(which is how the mulhs instruction interprets it). 


Forcing p to be at least W is justified by the following: 


Theorem DC1. If (6) is true for some value of p, then it is true for all 
larger values of p. 


Proof. Suppose (6) is true for p = po. Multiplying (6) by 2 gives 


2Po*! = p (2d — 2rem(2"*, d)). 


From Theorem D5, "eM 2Po*! fy = Prema, dj- d. 


Combining gives 


otl > p (2d - (rem(2"0*!, dy + d)), or 


Hy +] 


9Pq + l "y nd rem(2! e d) ). 


Therefore, (6) is true for p = po + 1, and hence for all larger values. 


Thus, one could solve (6) by a binary search, although a simple linear 
search (starting with p = W) is probably preferable, because usually d 
is small, and small values of d give small values of p. 


Proof That the Algorithm Is Feasible 


< 


We must show that (6) always has a solution and that 0 —m < 2W. 
(It is not necessary to show that p =W, because that is forced.) 


We show that (6) always has a solution by getting an upper bound on 
p. As a matter of general interest, we also derive a lower bound under 
the assumption that p is not forced to be at least W. To get these 
bounds on p, observe that for any positive integer x, there is a power 
of 2 greater than x and less than or equal to 2x. Hence from (6), 


n,(d —rem(2", d)) <2" S 2n (d -—rem(2", d)}). 


< < 


Because 0 —rem(2?P, d) =d - 1, 


Equation 7 


n+ ls2? 2nd. 


From (3a) and (3b), ne = max(2W- 1-d,d- 1). The lines f;(d) = 2W- 
1-d and fo(d) = d - 1 cross at d = (2W-1 + 1)/2. Hence n- >w- 1 


- 1)/2. Because ne is an integer, ne =2W - 2, Because ne, d —2W-1- 1, 
(7) becomes 


2W-24 1 < <2(2W-1_1}, 


or 
Equation 8 


W-leps2Ww-2. 


The lower bound p = W - 1 can occur (e.g., for W = 32, d = 3), but 
in that case we set p = W. 


If p is not forced to equal W, then from (4) and (7), 


n+ 1 2ndn. + | 
<< —— 
é do. 


Using (3b) gives 


7-1 +1] , , 
M emen, + 1). 


Because ne Zw- 1 - 1(3a), 


25me2"—-1,. 


If p is forced to equal W, then from (4), 


aw 2Wn, + | 
— oie — 
& fi Mp 


Because 2 Sa Sow- 1-1 and ne >w- 2, 


a 2WaW-24 | 
= Hf M a m, OF 
3W-1i_] T aWw-2 


3sms 2"-!4 1. 


Hence in either case m is within limits for the code schema illustrated 
by the "divide by 7" example. 


Proof That the Product Is Correct 


We must show that if p and m are calculated from (6) and (5), then 
equations (la) and (1b) are satisfied. 


Equation (5) and inequality (6) are easily seen to imply (4). (In the 
case that p is forced to be equal to W, (6) still holds, as shown by 
Theorem DC1.) In what follows, we consider separately the following 
five ranges of values of n: 


Qflnen,, 
n.tilsasn.+d—1, 
-nSsns-l, 


-n -d+ l&n&-n.-l, and 


From (4), because m is an integer, 


an in + 1-1 
= <n — 
d dno 


= 


Multiplying by n/2P, for n =0 this becomes 


yP r 
nin. + l)—n 
TeMi ee sso that 
d 2" 2" dn, 
Oe 
n || mn |a| 24 2D» | 
d 2F do 2"dn, 


For 0 <a Sn, 0 <i - 1)n/(2Pdn,) < 1/d, so by Theorem D4, 


n (2 - 1 at _|a 
do adn, E I 


Hence (1a) is satisfied in this case (0 Sa Shn. 


Forn > na nis limited to the range 


Equation 9 


n.+lensn.+d-1, 


because n nr. + d contradicts the choice of ne as the largest value of 
n such that rem(n,, d) = d - 1 (alternatively, from (3a), n a +d 
implies n =2W - 1). From (4), for n =0, 


n+ | 
r z a z rt i , 
d Wo d ne 
By elementary algebra, this can be written 
Equation 10 


no mn n+l + (a—n,.)(n,. + 1) 
dP d dn, 


< < 


From (9), 1 =n - ne =d - 1, so 


0< {n= n Aint 1) <d- pa. + 1 
dn g l 


į 


d n, 


Because ne 4 - 1 (by (8b)) and (ne + 1)/ne has its maximum when 
ne has its minimum, 


gc tbe +) -d-Md-1+1_, 
dn, d d-1 


In (10), the term (ne + 1)/d is an integer. The term (n - n) e + 1)/ 


dn, is less than or equal to 1. Therefore, (10) becomes 
n < mn < Mo + | ; 
af oP al 


Lnd] = (n. + ld. 
< 


satisfied in this case (ne + 1 =n =n, + d- 1). 


For all n in the range (9), Hence (1a) is 


Forn < 0, from (4) we have, because m is an integer, 


OP a oP n+ | 
~ tlemi 
é dot. 


4 


Multiplying by n/2P for n < 0 this becomes 


nħe+l mn g n4] 
dn, 2 d 2° 


n+l n3 4 | 
== +1s| 28 |+12s|== [|+1. 
don, ow d Y 

Using Theorem D2 gives 


Anes dnt ly el mn| yy ef n(2?+ I -2’d4+1], | 
dn, 2? 2Pd 


a(n, +1) +1 c| ml, ,<] 2(2?+De1]. 
dn, 2P 2’d 


Because n + 1 So, the right inequality can be weakened, giving 


Equation 11 
nyntly<|mal,.<[2]. 
d dn, oP a 


<=. 


For -ne =n =- 1, 


=n.+ | 
i eittl 


=—<£0, or 
dn, dn, 
_lentleg 
d dn, 


Hence by Theorem D4, 
i + | 
n + it = |a], 
d dn. d 
= 


so that (1b) is satisfied in this case (-ne Sha =-1). 


Forn < -na n is limited to the range 


Equation 12 


—n.-dens—n,—l. 


(From (3a), n < - ne - d implies that n < -2W-1, which is impossible.) 
Performing elementary algebraic manipulation of the left comparand 
of (11) gives 


Equation 13 


Bet A eR lS) clam |, efa], 
d dn, 2P d 


Z < 


For -ne - d =n =- ne - 1, 
(—d+ Im +i) 1 (etn yet It! (ntl) = 


dn, dn. dn, g dn, d 


The ratio (ne + 1)/ne is a maximum when ne is a minimum; that is, ne 
= d - 1. Therefore, 


(-d+1)(d-1+1), 1 .@+nyaet tl 9 
d(d—-1) dn, dn, 
(n+n (n +1)+1 
< —— < 


dn, 


or 


-| 0. 


From (13), because (- ne - 1)/d is an integer and the quantity added to 
it is between 0 and -1, 


Z < 


For n in the range -ne-d + 1 =n S- m-l, 
aye Ter l 
al él 


on = 
Hence Lan /2 J+ l= [n/d] hat is, (1b) is satisfied. 


The last case, n = - ne - d, can occur only for certain values of d. From 


(3a), -ne-d <. 2W -1, so if n takes on this value, we must have n 
ne- d = -2W-1, and hence nę = 2W-1-d. Therefore, rem(2W-1, d) 


rem(ne + d, d) = d - 1 (that is, d divides 2W-1 + 1). 


For this case (n = - ne - d), (6) has the solution p = W - 1(the smallest 
possible value of p), because for p = W - 1, 


n(d - rem(2”, d)) = (2"~-!—d)(d - rem(2W-!, d)) 


= (2"-!_d)(d-(d-1)) = 2¥-!-d<2"-1 =P., 


Then from (5), 


2W-1 4 d—rem(2"-!, d) m 2W-l4d—(d-1) _ 2W-14] 
d d d ; 


Therefore, 


malay = 2"! +1220] l= -2W-1_] +] 
2p d a7 d 


l 
| 
= 
i = 
~ 
| 
a 
_——— | 
+ 
ll 
| 
LE] 
aļ = 
a | 
— 
~ is 
~ "~ 
ee | 


so that (1b) is satisfied. 


This completes the proof that if m and p are calculated from (5) and 
(6), then Equations (1a) and (1b) hold for all admissible values of n. 


10-5 Signed Division by Divisors S.2 


Because signed integer division satisfies n + (-d) = -(n + d) it is 
adequate to generate code for n + |d| and follow it with an 
instruction to negate the quotient. (This does not give the correct 
result for d = -2W-1, but for this and other negative powers of 2, you 
can use the code in Section 10-1, "Signed Division by a Known Power 
of 2," on page 155, followed by a negating instruction.) It will not do 
to negate the dividend, because of the possibility that it is the 
maximum negative number. 


It is possible, however, to avoid the negating instruction. The scheme 
is to compute 


2 


gq = a | ifn<0, and 


q= Edi l ifn>0. 
2” 


Adding 1 ifn > 0, however, is awkward (because one cannot simply 
use the sign bit of n), so the code will instead add 1 if q < 0. This is 
equivalent because the multiplier m is negative (as will be seen). 


The code to be generated is illustrated below for the case W = 32, d 
= -7. 


li M, 0Ox6DB6DB6D Magic num, -(2**34+5)/7 + 2**32. 
mulhs q,M,n q = floor (M*n/2**32). 

sub Q,q,n q = floor (M*n/2**32) - n. 

shrsi q,q,2 q = floor (q/4). 

shri t,q,31 Add 1 to q if 

add årar t q is negative (n is positive). 


muli t,q,-7 Compute remainder from 


sub r,n,t ren = g*(-7). 


This code is the same as that for division by +7, except that it uses 
the negative of the multiplier for +7, anda sub rather than an add 
after the multiply, and the shri of 31 must use q rather than n, as 
discussed above. (The case of d = +7 could also use q here, but there 
would be less parallelism in the code.) The subtract will not overflow 
because the operands have the same sign. This scheme, however, does 
not always work! Although the code above for W = 32, d = -7 is 
correct, the analogous alteration of the "divide by 3" code to produce 
code to divide by -3 does not give the correct result for W = 32, n = 
-231, 


Let us look at the situation more closely. 


<<. 


Given a word size W >; and a divisor d, -2W -1 =d —-2, we wish 
to find the least (in absolute value) integer m and integer p such that 


Equation 14a 
may |n for-2"-1<n s0, and 
oP él 


Equation 14b 
Wye] =)" forlen<2¥-!, 
oP d 


with -2W Sn Se < 


=m —0 and p =W. 


Proceeding similarly to the case of division by a positive divisor, let ne 
be the most negative value of n such that ne = kd + 1 for some 
integer k. ne exists because one possibility is ne = d + 1. It can be 
calculated from 

ne = L(-2">!-1)/d]Jd +1 =-2">'+rem(2"~! +1, d). 


ne is one of the least |d| admissible values of n, so 


Equation 15a 


—2"-len.s-2"-!-d-1, 


and clearly 


Equation 15b 


Me ddl. 


Because (14b) must hold for n = -d, and (14a) must hold for n = ne, 
we obtain, analogous to (4), 


Equation 16 


oP Nn. —_ | ae 
— mcs, 
d n. d 


[i 


Because m is to be the greatest integer satisfying (16), it is the next 
integer less than 2P/d—that is, 


Equation 17 


2P — d—rem(2?’, d) 


d 


Combining this with the left half of (16) and simplifying gives 


Equation 18 


2° > n {d + rem(2", d)). 


The proof that the algorithm suggested by (17) and (18) is feasible 


and that the product is correct is similar to that for a positive divisor, 
and will not be repeated. A difficulty arises, however, in trying to 


<=. 


prove that -2W —m = 0. To prove this, consider separately the cases 
in which d is the negative of a power of 2, or some other number. For 


d = -2k it is easy to show that ne = -2W-1+ 1,p = W + k-1, and 
m = - 2W-1- 1 (which is within range). For d not of the form -2k, it is 
straightforward to alter the earlier proof. 


For Which Divisors Is m(-d) =-m(d)? 


By m(d) we mean the multiplier corresponding to a divisor d. If m(-d) 
= -m(d), code for division by a negative divisor can be generated by 
calculating the multiplier for |d|, negating it, and then generating 
code similar to that of the "divide by 7" case illustrated above. 


By comparing (18) with (6) and (17) with (5), it can be seen that if 
the value of ne for -d is the negative of that for d, then m(-d) = -m(d). 


Hence m(-d) mcd) can occur only when the value of ne calculated 
for the negative divisor is the maximum negative number, -2W - 


Such divisors are the negatives of the factors of 2W-1 + 1. These 
numbers are fairly rare, as illustrated by the factorings below 
(obtained from Scratchpad). 


21541 = 3-11-33 
2314] = 3- 715,827,883 


25 4] = 33. 19-43- 5419 - 77,158,673,929 


sie all these factors, m(-d) Amo, ot sketch: For d > 0 we have 
= 2W-1- d. Because rem(2W-1, d) = d - 1, (6) is satisfied by p = 


We 1 and hence also by p = W. For d< A owever, we have m: = 
-2W -1 and rem(2W-1, d) = |d| - 1. Hence (18) is not satisfied for p = 
W - 1 or for p = W, sop > W. 


10-6 Incorporation into a Compiler 


For a compiler to change division by a constant into a multiplication, 
it must compute the magic number M and the shift amount s, given a 
divisor d. The straightforward computation is to evaluate (6) or (18) 
for p = W, W + 1,... until it is satisfied. Then, m is calculated from 
(5) or (17). M is simply a reinterpretation of m as a signed integer, 
ands = p- W. 


The scheme described below handles positive and negative d with 
only a little extra code, and it avoids doubleword arithmetic. 


Recall that ne is given by 
2W-l4rem(2"-', d)-—1, if d> 0, 


-2W-1 4 rem(2"-1 + 1, d), if d<0. 


Hence |n,| can be computed from 


0, ifd>0, 
l, ifd, 


t= 39W-li4 


= 


ad = 1-—1-rem(r, |dl). 


The remainder must be evaluated using unsigned division, because of 
the magnitude of the arguments. We have written rem(t, |d|) rather 
than the equivalent rem(t, d), to emphasize that the program must 
deal with two positive (and unsigned) arguments. 


From (6) and (18), p can be calculated from 


Equation 19 


2r > |”, eleil —rem(2"”, lal), 


and then |m| can be calculated from (c.f. (5) and (17)): 


Equation 20 


W + |ad| —rem(2’, 


dl) 
Jal l 


lm| = 


Direct evaluation of rem(2P,|d|) in (19) requires "long division" 
(dividing a 2W-bit dividend by a W-bit divisor, giving a W-bit quotient 
and remainder), and in fact it must be unsigned long division. There is 
a way to solve (19), and in fact to do all the calculations, that avoids 
long division and can easily be implemented in a conventional HLL 
using only W-bit arithmetic. We do, however, need unsigned division 
and unsigned comparisons. 


We can calculate rem(2P, |d|) incrementally, by initializing two 
variables q and r to the quotient and remainder of 2P divided by |d| 
with p = 2W-1, and then updating q and r as p increases. 


As the search progresses—that is, when p is incremented by 1—q and 
r are updated from (see Theorem D5(a)) 


q = 2*q; 


r = 2* rs 


„Q 

ll 
„Q 
+ 
la 


r = r - abs(d);} 


The left half of inequality (4) and the right half of (16), together with 
= L2"/|d)}<2" 


representable as a W-bit unsigned integer. Also, 0 S, < |d| soris 
representable as a W-bit signed or unsigned integer. (Caution: The 


intermediate result 2r can exceed 2W-1 - 1, so r should be unsigned 
and the comparison above should also be unsigned.) 


the bounds proved for m, imply that q *so q is 


Next, calculate d = |d| - r. Both terms of the subtraction are 


representable as W-bit unsigned integers, and the result is also (1 Sa 
—|d|), so there is no difficulty here. 


To avoid the long multiplication of (19), rewrite it as 


“i 

The quantity ~ | l is representable as a W-bit unsigned integer 

J <9 , 

(eatery to p, from (1 9) it can be shown that ~ Hn, | lal 
rd = -2W ne = -ÈW i ees ee 2 so that 


a = yw 2/7 f9W=1 _ AW 

é |", | = u l) <2 for W =), Also, it is easily 
calculated incrementally (as p increases) in the same manner as for 
rem(2P, |d|). The comparison should be unsigned, for the case 
2P/|n | > oW-1 F 

d7 (which can occur, for large d). 

To compute m, we need not evaluate (20) directly (which would 
require long division). Observe that 


2” + |d| —rem(2", ld) _ A fo ee 
[al lal 


ary |n, | >Ë. 
The loop closure test is awkward to evaluate. The 
a 
quantity ~ ‘lis available only in the form of a quotient qı and a 
ry. aff 
remainder `! c| may or may not be an integer (it is an integer 


only ford = 2W-2 + 1 and a few negative values of d). The test 
ln) ss 


may be coded as 


q,<61(g,=8&r, =Ù). 


The complete procedure for computing Mand s from d is shown in 
Figure 10-1, coded in C, for W = 32. There are a few places where 
overflow can occur, but the correct result is obtained if overflow is 
ignored. 


Figure 10-1 Computing the magic number for signed division. 


struct ms {int M; // Magic number 


int s;}; // and shift amount. 


struct ms magic(int d) { // Must have 2 <= d <= 2**31-! 
// or —2**31 <= d <= -2. 
int p; 
unsigned ad, anc, delta, ql, rl, q2, r2, t; 
const unsigned two31 = 0x80000000; // 2**31. 


struct ms mag; 


ad = abs(d); 


t = two31 + ((unsigned)d >> 31); 


anc = t - 1 - t%ad; // Absolute value of nc. 
p = 31; ZI- Init. p- 
ql = two3l1/anc; // Init. ql = 2**p/|nc|. 
rl = two31 - ql*anc; // Init. r1 = rem(2**p, |nc]). 
q2 = two3l/ad; // Init. q2 = 2**p/ļal. 
r2 = two31 - q2*ad; // Init. r2 = rem(2**p, |dl). 
do { 
p=p+1; 
ql = 2*ql; // Update ql = 2**p/|nc|. 
rl = 2*rl; // Update r1 = rem(2**p, [nc]. 
if (rl >= anc) { // (Must be an unsigned 


ql = gl + 1; // comparison here). 


rl = rl - anc; } 


q2 = 2*q2; // 
r2 = 2*r2; // 
if (r2 >= ad) { // 

q2 = q2 + 1; // 


r2 = r2 - ad;} 


delta ad - r2; 


} while (q1 < delta || (q1 


mag.M = q2 + 1; 

if (d < 0) mag.M = -mag.M; 
mag.s = p - 32; 

return mag; 


} 


Update q2 = 2**p/|dl. 
Update r2 = rem(2**p, [a]. 
(Must be an unsigned 


comparison here). 


== delta && rl == 0)); 


// Magic number and 


// shift amount to return. 


To use the results of this program, the compiler should generate the 
li and mulhs instructions, generate the addif d > Oand M < 0, 
or the subif d < Oand M > O, and generate the shrsiif s > 0. 
Then, the shri and final add must be generated. 


For W = 32, handling a negative divisor may be avoided by simply 
returning a precomputed result ford = 3 andd = 715,827,883, and 
using m(-d) = -m(d) for other negative divisors. However, that 
program would not be significantly shorter, if at all, than the one 


given in Figure 10-1. 


10-7 Miscellaneous Topics 


Theorem DC2. The least multiplier m is odd if p is not forced to equal 
W. 


Proof. Assume that Equations (1a) and (1b) are satisfied with least 
(not forced) integer p, and m even. Then clearly m could be divided 
by 2 and p could be decreased by 1, and (1a) and (1b) would still be 
satisfied. This contradicts the assumption that p is minimal. 


Uniqueness 


The magic number for a given divisor is sometimes unique (e.g., for W 
= 32,d = 7), but often it is not. In fact, experimentation indicates 
that it is usually not unique. For example, for W = 32, d = 6, there 
are four magic numbers: 


M= = 715,827,833 ((2% + 2)/6), s=0 
M = 1,431,655,766 ((2¥ +2)/3), c=] 
M = —1,431,655,765 ((2® + 13/3 —252), 
M = -1,431,655,764 ((2%44)/3-277), 5 = 


art 
| 
ho 


bo 


However, there is the following uniqueness property: 


Theorem DC3. For a given divisor d, there is only one multiplier m 
having the minimal value of p, if p is not forced to equal W. 


Proof. First consider the case d > 0. The difference between the upper 
and lower limits of inequality (4) is 2P/dn,. We have already proved 


(7) that if p is minimal, then 2P/dn, <2 Therefore, there can be at 
most two values of m satisfying (4). Let m be the smaller of these 


values, given by (5); then m + 1 is the other. 


Let po be the least value of p for which m + 1 satisfies the right half 
of (4) (po is not forced to equal W). Then 


2Po + d- rem(2™, d) j e arene + l 
d do n, 


This simplifies to 


Wo > n (2d —rem(2"™, d)). 
Dividing by 2 gives 

P l j E l ry f 
woo > a,| d- srem(2 Od). 


Because Femm(2"4, d) = Zrem(Z?>~ 


2 


l 
sdl) (by Theorem D5 on page 


Fol > n (d- rem(2", d)), 


contradicting the assumption that po is minimal. 


The proof for d < 0 is similar and will not be given. 
The Divisors with the Best Programs 


The program for d = 3, W = 32 is particularly short, because there is 
no addor shrsi after the mulhs. What other divisors have this 
short program? 


We consider only positive divisors. We wish to find integers m and p 


that satisfy equations (1a) and (1b), and for which p = W and 0 Sm 


—2W -1, Because any integers m and p that satisfy equations (1a) and 
(1b) must also satisfy (4), it suffices to find those divisors d for which 


(4) has a solution with p = W and 0 Sm < 2W-1, All solutions of (4) 


with p = W are given by 


ght kd—rem(2",d) , 2] 
d l 


m = ND S ee 


Combining this with the right half of (4) and simplifying gives 
Equation 21 


yW 
rem", d) > kd- =. 


Ra 


The weakest restriction on rem(2W, d) is with k = 1 and ne at its 
minimal value of 2W - 2, Hence we must have 


rem(2", d) >d- 4; 


that is, d divides 2W + 1, 2W + 2, or 2W + 3. 
Now let us see which of these factors actually have optimal programs. 


If d divides 2W + 1, then rem(2“, d) = d - 1. Then a solution of (6) is 
p = W, because the inequality becomes 


2” >n(d—-—(d-1)) = na 


which is obviously true, because ne < 2W- 1, Then in the calculation 
of m we have 


ma 2W+d-{d-1) _ 2" +1 
d él 


which is less than 2W -1 for d 23 (d Æ> because d divides 2W + 1). 
Hence all the factors of 2W + 1 have optimal programs. 


Similarly, if d divides 2W + 2, then rem(2W, d) = d - 2. Again, a 
solution of (6) is p = W, because the inequality becomes 


2" > md- {d-2)) = 2n,, 


which is obviously true. Then in the calculation of m we have 


2¥4d-(d-2) _ 242 


m= : 
ad d 


which exceeds 2W-1 - 1 for d = 2, but which is less than or equal to 


2W-1-1 for W =3, d >; (the case W = 3 and d = 3 does not 
occur, because 3 is not a factor of 23 + 2 = 10). Hence all factors of 


2W + 2, except for 2 and the cofactor of 2, have optimal programs. 
(The cofactor of 2 is (2W + 2)/2, which is not representable as a W- 
bit signed integer). 


If d divides 2W + 3, the following argument shows that d does not 
have an optimal program. Because rem(2W, d) = d - 3, inequality (21) 
implies that we must have 


a 


of a 
Met d—d +3 


for some k = 1, 2, 3, .... The weakest restriction is with k = 1, so we 
must have ne < 2W/3. 


= 


rrom (3a), ne: =2W -1 - d, or d = ow- 1- ne. Hence it is necessary 
that 


aw w 
d>J2¥-l_= z 


3 6’ 


Also, since 2, 3, and 4 do not divide 2W + 3, the smallest possible 
factor of 2W + 3 is 5. Hence the largest possible factor is (2W + 3)/5. 
Thus, if d divides 2W + 3 and d has an optimal program, it is 
necessary that 


av Wo 
—<ds- +3 
6 5 


Taking reciprocals of this with respect to 2W + 3 shows that the 
cofactor of d, (2W + 3)/d, has the limits 


2V43 (2% 43)-6 18 


For W 2s, this implies that the only possible cofactors are 5 and 6. 
For W < 5, it is easily verified that there are no factors of 2W + 3. 


Because 6 cannot be a factor of 2W + 3, the only possibility is 5. 
Therefore, the only possible factor of 2W + 3 that might have an 
optimal program is (2W + 3)/5. 


For d = (2W + 3)/5, 


n my a, (=) 
e = iwz 5) | 


For W >, 
şi] 
2 <= —— 25 
(2" 4 3)/5 


SO 


This exceeds (2W/3), sod = (2W + 3)/5 does not have an optimal 
program. Because for W < 4 there are no factors of 2W + 3, we 
conclude that no factors of 2W + 3 have optimal programs. 


In summary, all the factors of 2W + 1 and of 2W + 2, except for 2 and 
(2W + 2)/2, have optimal programs, and no other numbers do. 


Furthermore, the above proof shows that algorithm magic (Figure 
10-1 on page 174) always produces the optimal program when it 
exists. 


Let us consider the specific cases W = 16, 32, and 64. The relevant 
factorizations are shown below. 


21641] 


21642 


65537 (prime) 252 + 641 - 6,700,417 
2-37- 11-331 232742 = 2.3 -715,827,883 
2% 4 1 = 274,177 - 67,280,421,310,721 

2642 = 2.33.19. 43 -5419 - 77,158,673,929 


Hence we have the results that for W = 16, there are 20 divisors that 
have optimal programs. The ones less than 100 are 3, 6, 9, 11, 18, 22, 
33, 66, and 99. 


For W = 32, there are six such divisors: 3, 6, 641, 6,700,417, 
715,827,883, and 1,431,655,766. 


For W = 32, there are 126 such divisors. The ones less than 100 are 3, 
6, 9, 18, 19, 27, 38, 43, 54, 57, and 86. 


10-8 Unsigned Division 


Unsigned division by a power of 2 is of course implemented by a 
single shift right logical instruction, and remainder by and immediate. 


It might seem that handling other divisors will be simple: Just use the 
results for signed division with d > 0, omitting the two instructions 
that add 1 if the quotient is negative. We will see, however, that some 
of the details are actually more complicated in the case of unsigned 
division. 


Unsigned Divide by 3 


For a non-power of 2, let us first consider unsigned division by 3 ona 
32-bit machine. Because the dividend n can now be as large as 222 - 1, 
the multiplier (232 + 2)/3 is inadequate, because the error term 2n/3 
- 232 (see "divide by 3" example above) can exceed 1/3. However, the 

multiplier (233 + 1)/3 is adequate. The code is 


li M, OXAAAAAAAB Load magic number, (2**33+1)/3. 
mulhu q,M,n q = floor (M*n/2**32). 


shri q,q,1 


muli t,q,3 Compute remainder from 
sub r; ny, t ren = g*3% 


An instruction that gives the high-order 32 bits of a 64-bit unsigned 
product is required, which we show above as mulhu. 


To see that the code is correct, observe that it computes 


_|28+1" |_la,_n 
3°98 3 3.28 


For 0 <a < 232, 0 E - 233) < 1/3, so by Theorem D4, 


g = Ln/3 J. 


In computing the remainder, the multiply immediate can overflow if 
we regard the operands as signed integers, but it does not overflow if 
we regard them and the result as unsigned. Also, the subtract cannot 
overflow, because the result is in the range 0 to 2, so the remainder is 
correct. 


Unsigned Divide by 7 


For unsigned division by 7 on a 32-bit machine, the multipliers (232 + 
3)/7, (233 + 6)/7, and (234 + 5)/7 are all inadequate because they 
give too large an error term. The multiplier (235 + 3)/7 is acceptable, 
but it's too large to represent in a 32-bit unsigned word. We can 
multiply by this large number by multiplying by (235 + 3)/7 - 232 and 
then correcting the product by inserting an add. The code is 


li M,0x24924925 Magic num, (2**35+3)/7 - 2**32., 
mulhu q,M,n q = floor (M*n/2**32). 

add qrq n Can overflow (sets carry). 
shrxi q,q,3 Shift right with carry bit. 
muli t,q,7 Compute remainder from 

sub Dt A oa a = ge. 


Here we have a problem: The add can overflow. To allow for this, we 
have invented the new instruction shift right extended immediate 
(shrxi), which treats the carry from the add and the 32 bits of 
register q as a single 33-bit quantity, and shifts it right with 0-fill. On 
the Motorola 68000 family, this can be done with two instructions: 
rotate with extend right one position, followed by a logical right shift 
of three (roxr actually uses the X bit, but the add sets the X bit the 
same as the carry bit). On most machines, it will take more. For 
example, on PowerPC it takes three instructions: clear rightmost three 
bits of q, add carry to q, and rotate right three positions. 


With shrxi implemented somehow, the code above computes 


For 0 <a < 232, oS 3n/(7 - 235) < 1/7, so by Theorem D4, 
g = [n77]. 


Granlund and Montgomery [GM] have a clever scheme for avoiding 
the shrxi instruction. It requires the same number of instructions as 
the above three-instruction sequence for shrxi, but it employs only 
elementary instructions that almost any machine would have, and it 
does not cause overflow at all. It uses the identity 


+n jaa 
lea = kaik /2?-" |, p> 1. 


= 1 432 
appin this to our problem, with q= LMn/2 Nees 0 =u 
< 


, the subtraction will not overflow because 
Mn Man , M 

n-g =ñ- = |n-—]|=|ajl l- s 
332 732 332 


so that clearly 0 S -q < 232, Also, the addition will not overflow, 
ecause 


eja Z 25244] 2 [=] 
2 | 2 7 


< 


and 0 =n, q < 232, 


Using this idea gives the following code for unsigned division by 7: 


li M,0x24924925 Magic num, (2**35+3)/7 =- 2**32. 


mulhu q,M,n q = floor (M*n/2**32). 

sub t,n,q t=n- q. 

shri t,t,1 t = (n - q)/2. 

add t,t,q t= (n - q)/2 + q = (n + q)/2. 
shri q,t,2 q = (n+Mn/2**32)/8 = floor(n/7). 
muli t,q,7 Compute remainder from 

sub r,n,t r= n = q*7. 


For this to work, the shift amount for the hypothetical shrxi 
instruction must be greater than 0. It can be shown that if d > 1 and 


the multiplier m 2232 (so that the shrxi instruction is needed), 
then the shift amount is greater than 0. 


10-9 Unsigned Division by Divisors 2 


Given a word size W >] and a divisor d, 1 Za < 2W, we wish to 
find the least integer m and integer p such that 


Equation 22 
mn | = | 2) for 0sne2, 
2P él 


with 0 Sn < 2W + land p ŽW. 


In the unsigned case, the magic number M is given by 


M, if O< me 2", 


an — 2" if 2¥ ome 2¥ +1, 


M = 


= x mp A 
Because (22) must hold for * = d, Lmd/2”] = l, or 


Equation 23 


As in the signed case, let ne be the largest value of n such that rem(ne, 
d) = d - 1. It can be calculated from 

= | owWyg ld- ] = OW rema A 
n |2¥7d ld- 1 = 2" - rem", d)- | "rhen 


Equation 24a 


2" —-dsn.s2"-1, 


and 


Equation 24b 


n.2d—I, 


These imply that ne >w -1, 


Because (22) must hold for n = ne, 
Mhe | _ |e | _ Re (d—-1) 
2P a il , 


or 


mon, n+l 


ar dd 


Combining this with (23) gives 


Equation 25 


aP aP. +l 
2 €a mg — 
é don 


E 


Because m is to be the least integer satisfying (25), it is the next 
integer greater than or equal to 2P/d—that is, 


Equation 26 


2? + d— 1 remi?” = 1, d) 


iH = 


él 


Combining this with the right half of (25) and simplifying gives 


Equation 27 


2° >n dd — 1 —rem(2"— 1, d)). 


The Algorithm (Unsigned) 


Thus, the algorithm is to find by trial and error the least p >w 
satisfying (27). Then m is calculated from (26). This is the smallest 


possible value of m satisfying (22) with p >w. As in the signed case, 
1f (27) is true for some value of p, then it is true for all larger values of 


p. The proof is essentially the same as that of Theorem DC1, except 
Theorem D5(b) is used instead of Theorem D5(a). 


Proof That the Algorithm Is Feasible (Unsigned) 


< 


We must show that (27) always has a solution and that 0 =m < 2W 


Because for any nonnegative integer x there is a power of 2 greater 
than x and less than or equal to 2x + 1, from (27), 


n(d = 1- rem(2” = 1, d)) <2? <S2n (d- 1- rem(2” = 1, d)) + 1. 
Because 0 S rem(2P -1, d) Sa -1, 
Equation 28 


12" 22n,.(d¢d-1) +1. 


Because ne, d Sow -1, this becomes 


1<2°<2(2"_ 1)(2"_-2) 41, 


or 


Equation 29 


Qepe2w. 


Thus, (27) always has a solution. 


If p is not forced to equal W, then from (25) and (28), 


l 2nd —1l)+ln.4+ 1 
= Si < ————_— . 
dl él i. 


2d- 2+ 7n, 
t 


lom<2(n,4+1)s2"*), 


{n 


lsme< etl), 


If p is forced to equal W, then from (25), 


w Wn + | 
— a mg = 
a do ne 


Because 1 Sa Sow - 1 and ne >- 1, 


aw JWW- ] 
<n <= — —— 
aW] | JW- 


i= 


2oms2" 4+ 1, 


Hence in either case m is within limits for the code schema illustrated 


by the "unsigned divide by 7" example. 
Proof That the Product Is Correct (Unsigned) 


We must show that if p and m are calculated from (27) and (26), then 
(22) is satisfied. 


Equation (26) and inequality (27) are easily seen to imply (25). 
Inequality (25) is nearly the same as (4), and the remainder of the 


Zo. 


proof is nearly identical to that for signed division with n 


10-10 Incorporation into a Compiler (Unsigned) 


There is a difficulty in implementing an algorithm based on direct 


evaluation of the expressions used in this proof. Although p Sow, 
which is proved above, the case p = 2W can occur (e.g., for d = 2W - 


2 with W >y. When p = 2W, it is difficult to calculate m, because 
the dividend in (26) does not fit in a 2W-bit word. 


However, it can be implemented by the "incremental division and 


remainder" technique of algorithm magic. The algorithm is given in 


Figure 10-2 for W = 32. It passes back an indicator a, which tells 
whether or not to generate an add instruction. (In the case of signed 
division, the caller recognizes this by M and d having opposite signs.) 


Figure 10-2 Computing the magic number for unsigned division. 


struct mu {unsigned M; 
int a; 


int s;}; 


// Magic number, 
// “add" indicator, 


// and shift amount. 


struct mu magicu(unsigned d) { 


int p; 
unsigned nc, delta, 


struct mu magu; 


magu.a = 0; 

ne = -1 — (-d) %d; 

p = 31; 

ql = 0x80000000/nc; 
rl = 


0x80000000 - ql*nc;// Init. rl 


// Must have 1 <=d <= 


rl, q2, 4x2; 


2**32-] 


// Initialize "add" indicator. 


// Unsigned arithmetic here. 


// Init. p: 


// TALE. ql 2**p/ne. 


rem(2**p, 


nc). 


q2 = OxX7FFFFFFF/d; // Init. q2 = (2**p - 1)/d. 


r2 = Ox7FFFFFFF - q2*d; // Init. r2 = rem(2**p - 1, d) 


do { 
p=p+1; 
if (rl >= nc - rl) { 
ql = 2*q1l + 1; // Update ql. 
rl = 2*rl1l - nc; } // Update rl. 
else { 
ql = 2*ql; 


rl = 2*rl1;} 
if (r2 + 1 >= d - 4r2) { 


if (q2 >= 0x7FFFFFFF) magu.a = 1; 


q2 = 2*q2 + 1; // Update q2. 
r2 = 2*r2 + 1 - dy} // Update r2. 
else { 


if (q2 >= 0x80000000) magu.a = 1; 


q2 = 2*q2; 
r2 = 2*r2 + 1;} 


delta e a Wh) =. 62 


} while (p < 64 && 


(ql < delta || (ql == delta && rl == 0))); 


magu.M = q2 + 1; // Magic number 


magu.s = p - 32; // and shift amount to return 


return magu; // (magu.a was set above). 


} 
Some key points in understanding this algorithm are as follows: 


Unsigned overflow can occur at several places and should be 
ignored. 


Ne = 2W -rem(2W, d) -1 = (2W- 1) - rem(2W - d, d). 


The quotient and remainder of dividing 2P by ne cannot be 
updated in the same way as is done in algorithm magic, 
because here the quantity 2*r1 can overflow. Hence the 
algorithm has the test "if (rl >= nc - r1)," whereas "if 
(2*r1 >= nc)" would be more natural. A similar remark 
applies to computing the quotient and remainder of 2P - 1 
divided by d. 


oSaS 


d - 1, so d is representable as a 32-bit unsigned 
integer. 


m = (2?+d—1-rem(2?-1,d))/d = L(2?-1)/d]4+1 = q+. 


The subtraction of 2W when the magic number M exceeds 2W - 
1 is not explicit in the program; it occurs if the computation of 
q2 overflows. 


The "add" indicator, magu.a, cannot be set by a 
straightforward comparison of M to 232, or of q2 to 232 - 1, 
because of overflow. Instead, the program tests q2 before 
overflow can occur. If q2 ever gets as large as 232 - 1, so that 
M will be greater than or equal to 232, then magu.a is set 
equal to 1. If q2 stays below 232 - 1, then magu.a is left at its 
initial value of 0. 


Inequality (27) is equivalent to 2P/n, > d. 


The loop test needs the condition "p < 64" because without 
it, overflow of q1 would cause the program to loop too many 
times, giving incorrect results. 


10-11 Miscellaneous Topics (Unsigned) 


Theorem DC2U. The least multiplier m is odd if p is not forced to 
equal W. 


Theorem DC3U. For a given divisor d, there is only one multiplier m 
having the minimal value of p, if p is not forced to equal W. 


The proofs of these theorems follow very closely the corresponding 
proofs for signed division. 


The Divisors with the Best Programs (Unsigned) 


For unsigned division, to find the divisors (if any) with optimal 
programs of two instructions to obtain the quotient (1i, mulhu), we 
can do an analysis similar to that of the signed case (see "The Divisors 
with the Best Programs" on page 175). The result is that such divisors 
are the factors of 2W or 2W + 1, except for d = 1. For the common 
word sizes, this leaves very few nontrivial divisors that have optimal 
programs for unsigned division. For W = 16, there are none. For W = 
32, there are only two: 641 and 6,700,417. For W = 64, again there 
are only two: 274,177 and 67,280,421,310,721. 


The case d = 2k, k = 1, 2, ..., deserves special mention. In this case, 
algorithm magicu produces p = W (forced), m = 232-, This is the 
minimal value of m, but it is not the minimal value of M. Better code 
results if p = W + k is used, if sufficient simplifications are done. 
Then, m = 2W, M = 0,a = 1, ands - k. The generated code involves 
a multiplication by 0 and can be simplified to a single shift right k 
instruction. As a practical matter, divisors that are a power of 2 would 
probably be special-cased without using magicu. (This phenomenon 
does not occur for signed division, because for signed division m 
cannot be a power of 2. Proof: For d > 0, inequality (4) combined 
with (3b) implies that d- 1 < 2P/m < d. Therefore, 2P/m cannot be 
an integer. For d < 0, the result follows similarly from (16) combined 
with (15b)). 


For unsigned division, the code for the case m ow is considerably 
worse than the code for the case m < 2W, if the machine does not 


have shrxi. Hence it is of interest to have some idea of how often 
the large multipliers arise. For W = 32, among the integers less than 
100, there are 31 "bad" divisors: 1, 7, 14, 19, 21, 27, 28, 31, 35, 37, 
38, 39, 42, 45, 53, 54, 55, 56, 57, 62, 63, 70, 73, 74, 76, 78, 84, 90, 


91, 95, and 97. 


Using Signed in Place of Unsigned Multiply, and the 
Reverse 


If your machine does not have mulhu, but it does have mulhs (or 
signed long multiplication), the trick given in "High-Order Product 
Signed from/to Unsigned," on page 132, might make our method of 
doing unsigned division by a constant still useful. 


That section gives a seven-instruction sequence for getting mulhu 
from mulhs. However, for this application it simplifies, because the 
magic number M is known. Thus, the compiler can test the most 
significant bit of the magic number, and generate code such as the 
following for the operation "mulhu q,M,n." Here t denotes a 
temporary register. 


M31 = 0 M31 = 1 
mulhs g,M,n mulhs g,M,n 
shrsi t,n,31 shrsi t,n,31 
and t,t,M and t,t,M 
add q,q,t add tyty 


add q,q,t 


Accounting for the other instructions used with mulhu, this uses a 
total of six to eight instructions to obtain the quotient of unsigned 
division by a constant, on a machine that does not have unsigned 
multiply. 


This trick may be inverted, to get mulhs in terms of mulhu. The 
code is the same as that above except the mulhs is changed to 
mulhu and the final add in each column is changed to sub. 


A Simpler Algorithm (Unsigned) 


Dropping the requirement that the magic number be minimal yields a 
simpler algorithm. In place of (27) we can use 


Equation 30 


2? > 3d — 1- rem?" - 1, d)), 


and then use (26) to compute m, as before. 


It should be clear that this algorithm is formally correct (that is, that 
the value of m computed does satisfy equation (22)), because its only 
difference from the previous algorithm is that it computes a value of p 
that, for some values of d, is unnecessarily large. It can be proved that 
the value of m computed from (30) and (26) is less than 2W + 1, We 
omit the proof and simply give the algorithm (Figure 10-3). 


Figure 10-3 Simplified algorithm for computing the magic number, 
unsigned division. 


struct mu {unsigned M; // Magic number, 
int a; // “add" indicator, 
int s;}; // and shift amount. 


struct mu magicu2(unsigned d) { 
// Must have 1 <= d <= 2**32-1 
int p; 
unsigned p32, q, r, delta; 


struct mu magu; 


magu.a = 0; // Initialize "add" indicator. 
p = 31; // Initialize p. 

q = Ox7FFFFFFF/d; // Initialize q = (2**p - 1)/« 
r = Ox7FFFFFFF - q*d; // Init. r = rem(2**p - 1, d). 
do { 


if (p == 32) p32 =1; // Set p32 = 2** (p-32). 


else p32 = 2*p32; 
if (r+1>=d- rr) { 


if (q >= Ox7FFFFFFF) magu.a = 1; 


q = 2*q + 1; // Update q. 
r = 2*r + 1 - d; // Update r. 
} 
else { 


if (q >= 0x80000000) magu.a = 1; 


2*q; 


Q 
ll 


r= 2*r + 1; 
} 
delta = d - 1 - r; 


} while (p < 64 && p32 < delta); 


magu.M = q + 1; // Magic number and 
magu.s = p - 32; // shift amount to return 
return magu; // (magu.a was set above). 


} 


Alverson [Alv] gives a much simpler algorithm, discussed in the next 
section, but it gives somewhat large values for m. The point of 
algorithm magicu2 is that it nearly always gives the minimal value for 


m when d Sow -1, For W = 32, the smallest divisor for which 
magicu2 does not give the minimal multiplier is d = 102,807, for 


which magicu calculates m = 2,737,896,999 and magicu2 calculates 
m = 5,475,793,997. 


There is an analog of magicu2 for signed division by positive divisors, 
but it does not work out very well for signed division by arbitrary 
divisors. 


10-12 Applicability to Modulus and Floor Division 


It might seem that turning modulus or floor division by a constant into 
multiplication would be simpler, in that the "add 1 if the dividend is 
negative" step could be omitted. This is not the case. The methods 
given above do not apply in any obvious way to modulus and floor 
division. Perhaps something could be worked out; it might involve 
altering the multiplier m slightly depending upon the sign of the 
dividend. 


10-13 Similar Methods 


Rather than coding algorithm magic, we can provide a table that gives 
the magic numbers and shift amounts for a few small divisors. Divisors 
equal to the tabulated ones multiplied by a power of 2 are easily 
handled as follows: 


1. Count the number of trailing O's in d, and let this be denoted 
by k. 


2. Use as the lookup argument d/2k (shift right k). 
3. Use the magic number found in the table. 
4. Use the shift amount found in the table, increased by k. 


Thus, if the table contains the divisors 3, 5, 25, and so on, divisors of 
6, 10, 100, and so forth can be handled. 


This procedure usually gives the smallest magic number, but not 
always. The smallest positive divisor for which it fails in this respect 
for W = 32 isd = 334,972, for which it computes m = 
3,361,176,179 and s = 18. However, the minimal magic number for d 
= 334,972 is m = 840,294,045, with s = 16. The procedure also fails 
to give the minimal magic number for d = -6. In both these cases, 
output code quality is affected. 


Alverson [Alv] is the first known to the author to state that the 
method described here works with complete accuracy for all divisors. 
Using our notation, his method for unsigned integer division by d is to 


p= W+ [loga d], 


set the shift amount and the multiplier 


= | FP 
m = [2 sd lana then do the division by 


+A = yP 
n+d = [mn = Jimat is, multiply and shift right). He proves 
that the multiplier m is less than 2W + 1, and that the method gets the 


exact quotient for all n expressible in W bits. 


Alverson's method is a simpler variation of ours in that it doesn't 
require trial and error to determine p, and is thus more suitable for 
building in hardware, which is his primary interest. His multiplier m, 
however, is always greater than or equal to 2W, and thus for the 
software application always gives the code illustrated by the "divide 
by 7" example (that is, always has the add and shrxi, or the 
alternative four instructions). Because most small divisors can be 


handled with a multiplier less than 2W it seems worthwhile to look for 
these cases. 


For signed division, Alverson suggests finding the multiplier for |d| 


and a word length of W - 1 (then 2W-1 Sn < 2W), multiplying the 


dividend by it, and negating the result if the operands have opposite 
signs. (The multiplier must be such that it gives the correct result 
when the dividend is 2W - 1, the absolute value of the maximum 
negative number). It seems possible that this suggestion might give 
better code than what has been given here in the case that the 


= 


multiplier m =.2W. Applying it to signed division by 7 gives the 
following code, where we have used the relation -x = x + 1 to avoid 
a branch: 


abs an,n 
li M,0x92492493 Magic number, (2**34+5)/7. 
mulhu q,M,an q = floor (M*an/2**32). 


shri q,q,2 


shrsi t,n,31 These three instructions 
xor qrq, E negate q if n is 
sub qrq E negative. 


This is not quite as good as the code we gave for signed division by 7 
(six vs. seven instructions), but it would be useful on a machine that 
has abs and mulhu but not mulhs. 


10-14 Sample Magic Numbers 


Table 10-1. Some Magic Numbers for W = 32 

USisjgadd 
M(Hex) 

9999999 

#5555555 

RAFFFFFF 

® 

BRPIO0001 

MAABSABASB 

HG666G6G6D 

@AAAAAAB 

92992998 

@8E38E39 

l AAAA) 

BR2BABE9 

MAAAADAB 

ASEBS51F 

@2524DD3 

625B3BAD 


Table 10-2. Some Magic Numbers for W = 64 

USisjgadd 
M (@ex) 

199999999999999 

į555555555555555 

F FFFFFFFFFFFFF 

® 

BeIGCIO000000000001 

QAAAAAAAAAAAASAAB 

AccececeececeeceeD 

AAAAAAAAAAAAAAAB 

222299229229922 923 

AGZEGZEGZEGZEGZE 

AGECEE6EE6ECECCEEEED 

MES8BA2E8BA2E8BA3 

MAAAAAAAAAAAAAAB 

S5DR0AZAROAZALOB 

828 4DBABEASBBERCP 

B2S5DC5D63886594B 


10-15 Exact Division by Constants 


By "exact division," we mean division in which it is known 
beforehand, somehow, that the remainder is 0. Although this situation 
is not common, it does arise, for example, when subtracting two 
pointers in the C language. In C, the result of p - q, where p and q are 
pointers, is well defined and portable only if p and q point to objects 
in the same array [H&S, sec. 7.6.2]. If the array element size is s, the 
object code for the difference p - q computes (p - q)/s. 


The material in this section was motivated by [GM, sec. 9]. 


The method to be given applies to both signed and unsigned exact 
division, and is based on the following theorem. 


Theorem MI. If a and m are relatively prime integers, then there exists 


an integer a’, 1 <r < m, such that 


da= 1 (mod m). 


Thus a is a multiplicative inverse of a, modulo m. There are several 
ways to prove this theorem; three proofs are given in [NZM, 52]. The 
proof below requires only a very basic familiarity with congruences. 


Proof. We will prove something a little more general than the 
theorem. If a and m are relatively prime (and hence nonzero), then as 
x ranges over all m distinct values modulo m, ax takes on all m 
distinct values modulo m. For example, if a = 3 and m = 8, then as x 
ranges from 0 to 7, ax = 0, 3, 6, 9, 12, 15, 18, 21 or, reduced modulo 
8, ax = 0, 3, 6, 1, 4, 7, 2, 5. Observe that all values from 0 to 7 are 
present in the last sequence. 


To see this in general, assume that it is not true. Then there exist 
distinct integers that map to the same value when multiplied by a; 


that is, there exist x and y, with x Æ, (mod m), such that 


ax= ay (mod m). 


But then there exists an integer k such that 


axy-ay = km, or 


alx- y) = km, 


Because a has no factor in common with m, it must be that x - y is a 
multiple of m; that is, 


x=y (mod nr). 


This contradicts the hypothesis. 


Now, because ax takes on all m distinct values modulo m, as x ranges 
over the m values, it must take on the value 1 for some x. 


The proof shows that there is only one value (modulo m) of x such 


that ax =1 (mod m)—that is, the multiplicative inverse is unique, 
apart from additive multiples of m. It also shows that there is a unique 


(modulo m) integer x such that ax =b (mod m)where b is any 
integer. 


As an example, consider the case m = 16. Then i= I |. because 
3:11 = 33 =1 (mod 16). We could just_as well take 3 = —3, because 


3(-5) = -15 =1 (mod 16). Similarly, =F = Ssbecause (-3)'5 = -15 
=] (mod 16). 


These observations are important because they show that the concepts 
apply to both signed and unsigned numbers. If we are working in the 


domain of unsigned integers on a 4-bit machine. we take 3 = Ilin 
the domain of signed integers, we take 4 = —5.But 11 and -5 have 
the same representation in two's-complement (because they differ by 
16), so the same computer word contents can serve in both domains 
as the multiplicative inverse. 


The theorem applies directly to the problem of division (signed and 
unsigned) by an odd integer d on a W-bit computer. Because any odd 
integer is relatively prime to 2W, the theorem says that if d is odd, 
there exists an integer d (unique in the range 0 to 2W - 1 or in the 
range -2W -1 to 2W-1- 1) such that 


dd=1 (mod 2"), 


Hence for any integer n that is a multiple of d, 

il, gma - F 
atidis nd (mod 2"), 
do d 


In other words, n/d can be calculated by multiplying n by d`, and 
retaining only the rightmost W bits of the product. 


If the divisor d is even, let d = dọ : 2k, where dọ is odd and k >; 
Then, simply shift n right k positions (shifting out 0's), and then 


stage by dy (the shift could be done after the multiplication as 
well). 
Below is the code for division of n by 7, where n is a multiple of 7. 


This code gives the correct result whether it is considered to be signed 
or unsigned division. 


li M,OxB6DB6DB7 Mult. inverse, (5*2**32 + 1)/7. 


mul q,M,n q = n/7. 


Computing the Multiplicative Inverse by the Euclidean 
Algorithm 


How can we compute the multiplicative inverse? The standard method 
is by means of the "extended Euclidean algorithm." This is briefly 
discussed below as it applies to our problem, and the interested reader 
is referred to [NZM, 13] and to [Knu2, sec. 4.5.2] for a more complete 
discussion. 


Given an odd divisor d, we wish to solve for x 


dx=1 (mod m), 


where, in our application, m = 2W and W is the word size of the 
machine. This will be accomplished if we can solve for integers x and 


y (positive, negative, or 0) the equation 


dx+ my = l. 


Toward this end, first make d positive by adding a sufficient number 
of multiples of m to it. (d and d + km have the same multiplicative 
inverse.) Second, write the following equations (in which d, m > 0): 


di—ly+ mil) 
(1) + mi0) 


m-d (i) 
dd, (ii) 


If d = 1, we are done, because (ii) shows that x = 1. Otherwise, 
compute 


es nial 
aiir | 


Third, multiply Equation (ii) by q and subtract it from (i). This gives 


d(—l—-g)+m(1) = m-d-gd = rem(m-—d, dh. 


This equation holds because we have simply multiplied one equation 
by a constant and subtracted it from another. If rem(m - d, d) = 1, we 
are done; this last equation is the solution and x = - 1 - q. 


Repeat this process on the last two equations, obtaining a fourth, and 
continue until the right-hand side of the equation is 1. The multiplier 
of d, reduced modulo m, is then the desired inverse of d. 


Incidentally, if m - d < d, so that the first quotient is 0, then the third 
row will be a copy of the first, so that the second quotient will be 
nonzero. Furthermore, most texts start with the first row being 


d(O)+m(1) = am, 


but in our application m = 2W is not representable in the machine. 


The process is best illustrated by an example. Let m = 256 and d = 7. 
Then the calculation proceeds as follows. To get the third row, note 
wnat g = L24977] = 35. 


7(-1) + 256( 1) = 249 


7( 1) + 256( 0) = 7 
7(-36) + 256( 1) = 4 
7( 37) + 256(-1) = 3 


7(-73) + 256( 2) = 1 


Thus, the multiplicative inverse of 7, modulo 256, is -73 or, expressed 
in the range 0 to 255, is 183. Check: 7-183 = 1281 =1 (mod 256). 


From the third row on, the integers in the right-hand column are all 
remainders with respect to the number above it as a divisor (d being 
the dividend), so they form a sequence of strictly decreasing 
nonnegative integers. Therefore, the sequence must end in 0 (as the 
above would if carried one more step). Furthermore, the value just 
before the 0 must be 1, for the following reason. Suppose the sequence 


ends in b followed by 0, with b 1. Then, the integer preceding the 
b must be a multiple of b, let's say kb, for the next remainder to be 0. 


The integer preceding kıb must be of the form kık2b + b, for the next 
remainder to be b. Continuing up the sequence, every number must be 
a multiple of b, including the first two (in the positions of the 249 and 
the 7 in the above example). But this is impossible, because the first 
two integers are m - d and d, which are relatively prime. 


This constitutes an informal proof that the above process terminates, 
with a value of 1 in the right-hand column, and hence it finds the 
multiplicative inverse of d. 


To carry this out on a computer, first note that if d < 0 we should add 
2W to it. But with two's-complement arithmetic it is not necessary to 
actually do anything here; simply interpret d as an unsigned number 
regardless of how the application interprets it. 


The computation of q must use unsigned division. 


Observe that the calculations can be done modulo m, because this 


does not change the right-hand column (these values are in the range 
0 to m - 1 anyway). This is important, because it enables the 
calculations to be done in "single precision," using the computer's 
modulo-2W unsigned arithmetic. 


Most of the quantities in the table need not be represented. The 
column of multiples of 256 need not be represented, because in 
solving dx + my = 1, we do not need the value of y. There is no need 
to represent d in the first column. Reduced to its bare essentials, then, 
the calculation of the above example is carried out as follows: 


255 249 
1 7 
220 4 
37 3 
183 1 


A C program for performing this computation is shown in Figure 10-4. 
Figure 10-4 Multiplicative inverse modulo 232 by the Euclidean algorithm. 


unsigned mulinv (unsigned d) { // d must be odd. 


unsigned xl, vl, x2, v2, x3, v3, 9; 


xl = OxFFFFFFFF; vl = -d; 


x2 1; v2 = d}; 


while (v2 > 1) { 
q = v1/v2; 
x3 = xl - q*x2; v3 = v1 - q*v2; 
xl = x2; vl = v2; 


x2 = x3; v2 = v3; 


return (x2); 


} 


The reason the loop continuation condition is (v2 > 1) rather than 
the more natural (v2 != 1) is that if the latter condition were used, 
the loop would never terminate if the program were invoked with an 
even argument. It is best that programs not loop forever even if 
misused. (If the argument dis even, v2 never takes on the value 1, 
but it does become 0.) 


What does the program compute if given an even argument? As 
written, it computes a number x such that dx =0 (mod 232), which is 
probably not useful. However, with the minor modification of 
changing the loop continuation condition to (v2 != 0) and 
returning x1 rather than x2, it computes a number x such that dx = 
g (mod 232) where g is the greatest common divisor of d and 232— that 
is, the greatest power of 2 that divides d. The modified program still 
computes the multiplicative inverse of d for d odd, but it requires one 
more iteration than the unmodified program. 


As for the number of iterations (divisions) required by the above 
program, for d odd and less than 20, it requires a maximum of 3 and 
an average of 1.7. For d in the neighborhood of 1000, it requires a 
maximum of 11 and an average of about 6. 


Computing the Multiplicative Inverse by Newton's Method 


It is well known that, over the real numbers, 1/d, for d Ko, can be 
calculated to ever increasing accuracy by iteratively evaluating 


Equation 31 


Aneel = Xk L dX 


provided the initial estimate xo is sufficiently close to 1/d. The 
number of digits of accuracy approximately doubles with each 
iteration. 


It is not so well known that this same formula can be used to find the 
multiplicative inverse in the domain of modular arithmetic on 
integers! For example, to find the multiplicative inverse of 3, modulo 
256, start with x9 = 1 (any odd number will do). Then, 


x, = (2-3-1) = -l, 
t = -1(2-3(-1)) = -5, 
x3 = -5(2-3(-5)) = -85, 
x, = -85(2-3(-85)) = -21845 =-85 (mod 256). 


The iteration has reached a fixed point modulo 256, so -85, or 171, is 
the multiplicative inverse of 3 (modulo 256). All calculations can be 
done modulo 256. 


Why does this work? Because if x, satisfies 
dx, = 1 (mod wt) 
and if x, + 1 is defined by (31), then 


dx, ,,= 1 (mod m*). 


To see this, let dx, = 1 + km. Then 

= dx,(2-—dx,) 

(1 + kory(3 — (1 + km) 
(1 +key] — ka) 

| —k*m-* 


| (mod m+). 


dx, +] 


lil 


In our application, m is a power of 2, say 2N. In this case, if 
dx,=1 (mod 2%), then 


dx ,,=1 (mod 2°"). 


In a sense, if x, is regarded as a sort of approximation to d7, then each 
iteration of (31) doubles the number of bits of "accuracy" of the 
approximation. 


It happens that, modulo 8, the multiplicative inverse of any (odd) 
number d is d itself. Thus, taking xo = dis a reasonable and simple 
initial guess at d`. Then, (31) will give values of x1, X2, ..., such that 


dx, =1 (mod 2°), 


dx,=1 (mod 2!*), 


dx,=1 (mod 2*4), 


1 (mod 27°), and so on. 


Il! 


dx, 


Thus, four iterations suffice to find the multiplicative inverse modulo 


232 (if x =1 (mod 248) then x =1 (mod 2") for n S48). This leads 
to the C program in Figure 10-5, in which all computations are done 


modulo 232. 
Figure 10-5 Multiplicative inverse modulo 232 by Newton's method. 


unsigned mulinv(unsigned d) { // d must be odd. 


unsigned xn, t}; 


loop: t = d*xn; 
if (t == 1) return xn; 
xn = xn* (2 - t); 


goto loop; 


For about half the values of d, this program takes 4 1/2 iterations, or 
nine multiplications. For the other half (those for which the initial 
value of xn is "correct to 4 bits'—that is, d2 =1 (mod 16)), it takes 
seven or fewer, usually seven, multiplications. Thus, it takes about 
eight multiplications on average. 


A variation is to simply execute the loop four times, regardless of d, 
perhaps "strung out" to eliminate the loop control (eight 
multiplications). Another variation is to somehow make the initial 
estimate xq "correct to 4 bits" (that is, find xo that satisfies dxo =1 
(mod 16)). Then, only three loop iterations are required. Some ways 
to set the initial estimate are 


x, d+2((d+1)&4), and 
x, d?*+d-1. 


Here, the multiplication by 2 is a left shift, and the computations are 
done modulo 222 (ignoring overflow). Because the second formula 
uses a multiplication, it saves only one. 


This concern about execution time is of course totally unimportant for 
the compiler application. For that application, the routine would be so 
seldom used that it should be coded for minimum space. But there 
may be applications in which it is desirable to compute the 
multiplicative inverse quickly. 


Sample Multiplicative Inverses 


We conclude this section with a listing of some multiplicative inverses 
in Table 10-3. 


Table 10-3. Sample Multiplicative Inverses 
d 


micie) 

9229929949249249 
3333333333333333 
5555555555555555 
FFFFFFFFFFFFFFFF 
1 

ARAAAAABAAAAAAAB 
BECCCCCDCCCCCCCD 


BHBBHBBBDB6DBEDB7 
9BRSBB8H38E38E39 
BE2BABA8BA2E8BA3 
SBBEGREGBRC4EC4EC5 
EBEEEEEEEEEEEEEF 
BERSESBR9C28F5C29 
2BBOU8BH26E978D5 
6ABBCEYO3AFB7E91 


You may notice that in several cases (d = 3, 5, 9, 11), the 
multiplicative inverse of d is the same as the magic number for 
unsigned division by d (see Section 10-14, "Sample Magic Numbers," 
on page 189). This is more or less a coincidence. It happens that for 
these numbers, the magic number M is equal to the multiplier m, and 


fae are of the form (2P + 1)/d, with p = 30. In this case, notice 
that 


Pa 
Md = (- Haz (mod 232), 
if F 


so that M Œd (mod 232). 


10-16 Test for Zero Remainder after Division by a Constant 


The multiplicative inverse of a divisor d can be used to test for a zero 
remainder after division by d [GM]. 


Unsigned 


First, consider unsigned division with the divisor d odd. Denote by d 


the multiplicative inverse of d. Then, because dd =1 (mod 2), 
where W is the machine's word size in bits, d is also odd. Thus, dis 


relatively prime to 2W, and as shown in the proof of theorem MI in the 
preceding section, as n ranges over all 2W distinct values modulo 2W, 
nd takes on all 2W distinct values modulo 2, 


It was shown in the preceding section that if n is a multiple of d, 


“and (mod 2"). 
d 


That is, for 
n = 0, d, 2d, ..., L\(2V-1)/d Jd, nd =0, 1,2, ..., L(2"-1)/d] (mod 2). 
Therefore, for n not a multiple of d, the value of nd’, reduced modulo 


aW 7° 

2W to the range 0 to 2W - 1, must exceed Le = l yd]. 

This can be used to test for a zero remainder. For example, to test if an 
integer n is a multiple of 25. multivlv n by 25 and compare the 


aW _ 7 
rightmost W bits to Le = 1)/25 i our basic RISC: 


li M,OxC28F5C29 Load mult. inverse of 25. 
mul q,M,n q = right half of M*n. 

li c, 0x0A3D70A3 c = floor((2**32-1)/25). 
cmpleu t,q,c Compare q and c, and branch 
bt t,is_mult if n is a multiple of 25. 


To extend this to even divisors, let d = do - 2k, where do is odd and k 


2. Then, because an integer is divisible by d if and only if it is 


divisible by do and by 2k, and because n and Ad» nave the same 


number of trailing zeros ( dais odd), the test that n is a multiple of d 
is 


Set g = modind „ 2"); 


qs an = 1)/e, | and g ends in & or more 0-bits, 


where the "mod" function is understood to reduce ndoto the interval 
0, - 


Direct implementation of this requires two tests and conditional 

branches, but it can be reduced to one compare-branch quite 

efficiently if the machine has the rotate-shift instruction. This follows 
raf 

from the following theorem, in which ak denotes the computer 


word a rotated right k positions (0 Sx S32), 


THEOREM ZRU. x <a and x ends in k O-bits if and only if 
x Sk&La/2* J. 


dil 
Proof. (Assume a 32-bit machine.) Suppose TS and x ends in k 0- 


7 ak |e k 
bits. Then, because * =“: Lx/2 J = La/?2 Tiia 
Fa 


k| = pg ke ok 
[x/? | =< > K. Therefore, x sh 5 La/2 lies does not 


rot y, 
end ink 0-bits. then * = k does not begin with k 0-bits, whereas 


J n Fray . i df 
La /2* | does, so Yank > La/2* J. sty, if © > Gand x 
ends in k 0-bits, then the integer formed from the first 32 - k bits of x 
must exceed that formed from the first 32 - k bits of a, so that 


[x /2* | La /2# |. 
Using this theorem, the test that n is a multiple of d, where n and d 


=) are unsigned integers and d = dy: 2, with do odd, is 


q e mod(nd,, 2"); 


fry 


q=k&L(2"-1)/d]. 


Here we used 


LL (2-1/4, |/2*| = L2”- pd, -25 | = L@"- 1a]. 


As an example, the following code tests an unsigned integer n to see if 
it is a multiple of 100: 


li M, 0xC28F5C29 Load mult. inverse of 25. 
mul q,M,n q = right half of M*n. 
shrri q,q,2 Rotate right two positions. 
li c,x'O028F5C28' c = floor((2**32-1)/100). 
cmpleu t,q,c Compare q and c, and branch 
bt t,is_mult if n is a multiple of 100. 


Signed, Divisor =2 


For signed division, it was shown in the preceding section that ifn isa 
multiple of d, and d is odd, then 


Cand (mod 2), 
d 


Thus, for 

n = [-2"-l/d]-d,...,-d.0, d, ...,.L(2¥-'-1)/d >, 
we have 

nd = [-2¥-!/d],...,-1,0, 1, ... L L(2¥-!-1)/d |] (mod 2"). 
Furthermore, because d` is relatively prime to 2W, as n ranges over all 


2W distinct values modulo 2W, nd takes on all 2W distinct values 
modulo 2W. Therefore, n is a multiple of d if and only if 


[—2"-!' 4d] < mod(nd, 2% £ | (2"-!-1)/d], 


where the "mod" function is understood to reduce nd’ to the interval 
[-2W-1, W-1.. 77, 


This can be simplified a little by observing that because d is odd and, 
as we are assuming, positive and not equal to 1, it does not divide 2W - 
1, Therefore, 


[—2"-'7d] = [(-—2"-!41)4d] = -L(2"-'-1/a]. 


Thus, for signed numbers, the test that n is a multiple of d, where d = 
do - 2k and do is odd, is 


Set q = mod(nd,, 2"); 


-= |Y |- 1)/d, | <q<|(2"-!-1)/d, | and q ends in k or more 0-bits. 


On the surface, this would seem to require three tests and branches. 
However, as in the unsigned case, it can be reduced to one compare- 
branch by use of the following theorem. 


Theorem ZRS. If a =o, the following assertions are equivalent: 


L o <x =a and x ends in & or more (-bits, 
> (2) abstr) sek <La/2* |, and 


7 W yp dt, 2 sak 
3, (3) x ta “ek = | 2a fa J, 
where a' is a with its rightmost k bits set to O (that is, a' = a & -2k). 


Proof. (Assume a 32-bit machine). To see that (1) is equivalent to (2), 


clearly the assertion -a £ =, is equivalent to abs(x) S. Then, 
Theorem ZRU applies, because both sides of this inequality are 


nonnegative. 


To see that (1) is equivalent to (3), note that assertion (1) is 


equivalent to itself with a replaced with a'. Then, by the theorem on 
bounds checking on page 52, this in turn is equivalent to 


x+a’=2a’. 


Because x + a' ends in k 0-bits if and only if x does, Theorem ZRU 
applies, giving the result. 


Using part (3) of this theorem, the test that n is a multiple of d, where 
nandd => are signed integers and d = dy - 2k with do odd, is 


qe mod(id,,, 2: 
a’ —|(2"-!_ 1)/d, | &-2*; 


q +a SkEL(2a')/2* |. 


(a' may be computed at compile time, because d is a constant.) 

As an example, the following code tests a signed integer n to see if it is 
. i | 2a’ £25 | 

a multiple of 100. Notice that the constant L= = can always be 


derived from the constant a' by a shift of k - 1 bits, saving an 
instruction or a load from memory to develop the comparand. 


li M, 0xC28F5C29 Load mult. inverse of 25. 

mul q,M,n q = right half of M*n. 

li c,0x051EB850 c = floor((2**31 - 1)/25) & -4. 
add rår è Add c. 

shrri. q,q,2 Rotate right two positions. 
shri GC, co; 1 Compute const. for comparison. 
cmpleu t,q,c Compare q and c, and 


bt t,is_mult branch if n is a mult. of 100. 


I think that I shall never envision 

An op unlovely as division. 

An op whose answer must be guessed 

And then, through multiply, assessed; 

An op for which we dearly pay, 

In cycles wasted every day. 

Division code is often hairy; 

Long division's downright scary. 

The proofs can overtax your brain, 

The ceiling and floor may drive you insane. 
Good code to divide takes a Knuthian hero, 


But even God can't divide by zero! 


Chapter 11. Some Elementary Functions 


Integer Square Root 
Integer Cube Root 
Integer Exponentiation 


Integer Logarithm 


11-1 Integer Square Root 


By the "integer square root" function, we mean the function Lf ih 
To extend its range of application and to avoid deciding what to do 


<= 


with a negative argument, we assume x is unsigned. Thus, 0 S = 
232 - 1. 


Newton's Method 


For floating-point numbers, the square root is almost universally 
computed by Newton's method. This method begins by somehow 


obtaining a starting estimate go of Ja "Then, a series of more 
accurate estimates is obtained from 


a 
Snel = (e,+4)/2. 


RN 


The iteration converges quadratically—that is, if at some point is 
accurate to n bits, then gn + ı is accurate to 2n bits. The program must 
have some means of knowing when it has iterated enough, so it can 
terminate. 


It is a pleasant surprise that Newton's method works fine in the 
domain of integers. To see this, we need the following theorem: 


THEOREM. Let g,,,; = |(g,+[a/g, |)/2 |, with g, a integers greater 
than 0. Then 

(a) if 2, > | Ja | then Lva] S 8n41 € 2p and 

(b) if g, = Lda] then (Ja ]< gnp slĒa]+ 1. 


That is, if we have an pral guess gn to La ir is too high, then 
es 


the next guess gn + 1 will be strictly less than the preceding one, but 


not less than La Fe if we start with a guess that's too 
high, the sequence converges monotonically. If the guess 


F — r 
$ N La Tees the next guess is either equal to gn or 4 1 larger. 
This provides an easy way to determine when the sequence has 


1 > : ] 
converged: If we start with Sos [va * convergence has occurred 
when gn + 1 =n, and then the result is precisely gn. 


The case a = 0, however, must be treated specially, because this 
procedure would lead to dividing 0 by 0. 


Proof. (a) Because gn is an integer, 


i 
8n4+1 = (s.+ a y2 = eré 
En Enj 


Because En? [Valaa gn is an integer, En? dacie e by 


Ea = (1 +£) Sdn e > 0 and 


es +a 
28, 


paral 
2(1+e)Ja 


Iie e b 
Da 
=f 
sh 
a3 
=e 


949 3 
Zra tl = foal = Bm 
2({1+e8) 

24+ 3¢ 

= a |f, 
5c | ees 


| Ja | S Enel = En 


(b) Because En = Lal, Ju- l SEn < fa, that 


2 j2 
fy sa<(g,+ 1) “Hence we have 


£ te PTO e + (8, +1 | 


28, 28 


Li] 8015] 8 ttt 5 | 


On 


‘ i l 
| fa | Sgn | 8n + 1| (because g, is an integer and 5— < 1), 


LJal<g,..<|g,J+1=LvaJ+1. 


En 


< 


The difficult part of using Newton's method to calculate Lf Ji 
getting the first guess. The procedure of Figure 11-1 sets the first guess 


go equal to the least power of 2 that is greater than or equal to vx : 
or example, for x =4, go = 2, and for x = 5, go = 4. 


Figure 11-1 Integer square root, Newton's method. 


int isqrt (unsigned x) { 
unsigned x1; 


int s, g0, gl; 
if (x <= 1) return x; 
if (x1 > 65535) {s = s + 8; xl = xl >> 16; } 


if (xl > 255) {s = s + 4; xl = xl >> 8;} 


if (x1 > 15) {s = s + 2; xl = x1 >> 4;} 


g0 = 1 << s; // g0 = 2**s, 


g1 = (g0 + (x >> s)) >> 1; // gl = (g0 + x/g0)/2. 


while (gl < gO) { // Do while approximations 
g0 = gl; // strictly decrease. 
gl = (g0 + (x/g0)) >> 1; 


} 


return g0; 


} 


Because the first guess go is a power of 2, it is not necessary to doa 
real division to get g1; instead, a shift right suffices. 


Because the first guess is accurate to about 1 bit, and Newton's 
method converges quadratically (the number of bits of accuracy 
doubles with each iteration), one would expect the procedure to 
converge within about five iterations (on a 32-bit machine), which 
requires four divisions (because the first iteration substitutes a shift 
right). An exhaustive experiment reveals that the maximum number of 
divisions is five, or four for arguments up to 16,785,407. 


If number of leading zeros is available, then getting the first guess is 
very simple: Replace the first seven executable lines in the procedure 
above with 


if (x <= 1) return x; 
s = 16 - nlz(x - 1)/2; 


Another alternative, if number of leading zeros is not available, is to 
compute s by means of a binary search tree. This method permits 
getting a slightly better value of go: the least power of 2 that is greater 


than or equal to Lv b+ 5c. some values of x, this gives a smaller | 
value of go, but a value large enough so that the convergence criterion 


of the theorem still holds. The difference in these schemes is 
illustrated below. 


Rangelifst foudsggge 11-2 


1to3 


2to8 

9 to 26 
23 to 60 
85 to 288 


028441108 B5215 + 1)2 -1 
0205-41108 85232- 1 


This procedure is shown in Figure 11-2. It is convenient there to treat 


Z 5 


small values of x (0 —x — 24) specially, so that no divisions are done 
for them. 


Figure 11-2 Integer square root, binary search for first guess. 


int isqrt (unsigned x) { 


int s, gO, gl; 


if (x <= 3) return (x + 3) >> 2; 
else if (x <= 8) return 2; 
else return (x >> 4) + 3; 


else if (x <= 288) 


else if (x <= 1088) x = 5; else s = 6; 
else if (x <= 1025*1025 - 1) 
if (x <= 257*257 = 1) 
if (x <= 129*129 - 1) s = 7; else s = 8; 
else if (x <= 513*513 - 1) s = 9; else s = 10; 


else if (x <= 4097*4097 - 1) 


if (x <= 2049*2049 - 1) s = 11; else s = 12; 
else if (x <= 16385*16385 - 1) 

if (x <= 8193*8193 - 1) s = 13; else s = 14; 
else if (x <= 32769*32769 - 1) s = 15; else s = 16; 


g0 = 1 << gs; // gO = 2**s, 


// Continue as in Figure 11-1. 


The worst-case execution time of the algorithm of Figure 11-1, on the 
basic RISC, is about 26 + (D + 6)n cycles, where D is the divide time 
in cycles and n is the number of times the while-loop is executed. The 
worst-case execution time of Figure 11-2 is about 27 + (D + 6)n 
cycles, assuming (in both cases) that the branch instructions take one 
cycle. The table below gives the average number of times the loop is 
executed by the two algorithms, for x uniformly distributed in the 
indicated range. 


Figure 11-2 
0.809 
0.8899 
0.88999 
A.089999 
0.87 232 - 1 


If we assume a divide time of 20 cycles and x ranging uniformly from 
0 to 9999, then both algorithms execute in about 81 cycles. 


Binary Search 


Because the algorithms based on Newton's method start out with a 
sort of binary search to obtain the first guess, why not do the whole 
computation with a binary search? This method would start out with 
two bounds, perhaps initialized to 0 and 216. It would make a guess at 
the midpoint of the bounds. If the square of the midpoint is greater 
than the argument x, then the upper bound is changed to be equal to 
the midpoint. If the square of the midpoint is less than the argument 
x, then the lower bound is changed to be equal to the midpoint. The 
process ends when the upper and lower bounds differ by 1, and the 
result is the lower bound. 


This avoids division, but requires quite a few multiplications—16 if 0 
and 216 are used as the initial bounds. (The method gets one more bit 
of precision with each iteration.) Figure 11-3 illustrates a variation of 
this procedure, which uses initial values for the bounds that are slight 
improvements over 0 and 216, The procedure shown in Figure 11-3 
also saves a cycle in the loop, for most RISC machines, by altering a 


= = 


ane b in such a way that the comparison is b =a rather than b - a = 


Figure 11-3 Integer square root, simple binary search. 


int isqrt (unsigned x) { 


unsigned a, b, m; // Limits and midpoint. 
a= 1; 
b = (x >> 5) + 8; // See text. 


if (b > 65535) b = 65535; 


do { 
m = (a + b) >> 1; 
if (m*m > x) b = m - 1; 
else a =m + 1; 


} while (b >= a); 
return a = 1; 


} 


The predicates that must be maintained at the beginning of each 
1s ' = 
iteration are “ = L af ] +] and bz LJ ll ie initial value of b 


should be something that's easy to compute and close to Lx 
Reasonable initial values are x, x + 4 + 1,x + 8 + 2,x + 16 + 4,x 


+ 32 + 8,x + 64 + 16, and so on. Expressions near the beginning of 
this list are better initial bounds for small x, and those near the end 
are better for larger x. (The value x + 2 + 1 is acceptable, but 


probably not useful, because x + 4 + 1 is everywhere a better or 
equal bound.) 


Seven variations on the procedure shown in Figure 11-3 can be more 
or less mechanically generated by substituting a + 1 for a, or b - 1 for 
b, or by changing m = (a + b) + 2tom = (a+ b+ 1) + 2, or 
some combination of these substitutions. 


The execution time of the procedure shown in Figure 11-3 is about 6 
+ (M + 7.5)n, where M is the multiplication time in cycles and n is 
the number of times the loop is executed. The table below gives the 
average number of times the loop is executed, for x uniformly 
distributed in the indicated range. 


Average Number .wf Loop Iterations 
8.009 
8.1699 
4.68999 
B.049999 
040032 - 1 


If we assume a multiplication time of 5 cycles and x ranging uniformly 
from 0 to 9999, the algorithm runs in about 94 cycles. The maximum 
execution time (n = 16) is about 206 cycles. 


If number of leading zeros is available, the initial bounds can be set 
from 


b = (1 << (33 - nlz(x))/2) - 1; 
a= (b + 3)/2; 
SIELE 1+2 
That is, b = 203-nh +L l. These are very good bounds for 


small values of x (one np iteration for 0 =x —15), but only a 
moderate improvement, for large x, over the bounds calculated in 


Figure 11-3. For x in the range 0 to 9999, the average number of 
iterations is about 5.45, which gives an execution time of about 74 
cycles, using the same assumptions as above. 


A Hardware Algorithm 
There is a shift-and-subtract algorithm for computing the square root 


that is quite similar to the hardware division algorithm described in 
Figure 9-2 on page 149. Embodied in hardware on a 32-bit machine, 


this algorithm employs a 64-bit register that is initialized to 32 0-bits 
followed by the argument x. On each iteration, the 64-bit register is 
shifted left two positions, and the current result y (initially 0) is 
shifted left one position. Then 2y + 1 is subtracted from the left half 
of the 64-bit register. If the result of the subtraction is nonnegative, it 
replaces the left half of the 64-bit register, and 1 is added to y (this 
does not require an adder, because y ends in 0 at this point). If the 
result of the subtraction is negative, then the 64-bit register and y are 
left unaltered. The iteration is done 16 times. 


This algorithm was described in 1945 [JVN]. 


Perhaps surprisingly, this process runs in about half the time of that of 
the 64 + 32 == 32 hardware division algorithm cited, because it 
does half as many iterations and each iteration is about equally 
complex in the two algorithms. 


To code this algorithm in software, it is probably best to avoid the use 
of a doubleword shift register, which requires about four instructions 
to shift. The algorithm in Figure 11-4 [GLS1] accomplishes this by 
shifting y and a mask bit m to the right. It executes in about 149 basic 
RISC instructions (average). The two expressions y | m could also be 
y +m. 


Figure 11-4 Integer square root, hardware algorithm. 


int isqrt (unsigned x) { 


unsigned m, y, b; 


m = 0x40000000; 

y = 0; 

while(m != 0) { // Do 16 times. 
b=y | m; 
y = y > 1; 


return y; 


} 


The operation of this algorithm is similar to the grade-school method. 


Ay ye 
It is illustrated below, for finding L l; Non an 8-bit machine. 
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(Can't subtract). 


The result is 13 with a remainder of 10 left in register x. 


It is possible to eliminate the if x >= b test by the usual trickery 
involving shift right signed 31. It can be proved that the high-order bit 


of b is always zero (in fact, b S5 - 228), which simplifies the x >= 


b predicate (see page 22). 


The result is that the if statement group can 


be replaced with 


t = (int) (x | ~(x - b)) >> 31; // -1 if x >= b, else 0. 
x =x - (b & t); 
y=y | (m& t); 


This replaces an average of three cycles with seven, assuming the 
machine has or not, but it might be worthwhile if a conditional branch 
in this context takes more than five cycles. 


Somehow it seems that it should be easier than some hundred cycles 
to compute an integer square root in software. Toward this end, we 
offer the expressions below to compute it for very small values of the 
argument. These can be useful to speed up some of the algorithms 
given above, if the argument is expected to be small. 
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Ah, the elusive square root, 

It should be a cinch to compute. 
But the best we can do 

Is use powers of two 


And iterate the method of Newt! 


11-2 Integer Cube Root 


For cube roots, Newton's method does not work out very well. The 
iterative formula is a bit complex: 


Ana1 = 3 


and there is of course the problem of getting a good starting value xo. 
However, there is a hardware algorithm, similar to the hardware 


algorithm for square root, that is not too bad for software. It is shown 
in Figure 11-5. 


Figure 11-5 Integer cube root, hardware algorithm. 


int icbrt (unsigned x) { 
int s; 


unsigned y, b; 


s = 30; 

y = 0; 

while(s >= 0) { // Do 11 times. 
y = 2*y; 
b = (3*y*(y + 1) + 1) << s; 
s = sS - 3; 


< 
Í 
< 
+ 
ay 


} 


return y; 


} 


The three add's of 1 can be replaced by or's of 1, because the value 
being incremented is even. Even with this change, the algorithm is of 
questionable value for implementation in hardware, mainly because of 
the multiplication y * (y + 1). 


This multiplication is easily avoided by applying the compiler 
optimization of strength reduction to the y-squared term. Introduce 
another unsigned variable y2 that will have the value of y-squared, 
by updating y2 appropriately wherever y receives a new value. Just 
before y = 0 insert y2 = 0. Just before y = 2*y insert y2 = 
4*y2. Change the assignment to bto b = (3*y2 + 3*y + 1) << 
s (and factor out the 3). Just before y = y + 1,insert y2 = y2 + 
2*y + 1. The resulting program has no multiplications except by 
small constants, which can be changed to shift's and add's. This 
program has three add's of 1, which can all be changed to or's of 1. It 
is faster unless your machine's multiply instruction takes only two or 
fewer cycles. 


Caution: [GLS1] points out that the code of Figure 11-5, and its 
strength-reduced derivative, do not work if adapted in the obvious 
way to a 64-bit machine. The assignment to b can then overflow. This 
problem can be avoided by dropping the shift left of s from the 
assignment to b, inserting after the assignment to b the assignment 
bs = b << s, and changing the two lines if (x >= b) {x = x 
— b...to if (x >= bs && == (bs >> s)) {x = x - bs 


11-3 Integer Exponentiation 

Computing x" by Binary Decomposition of n 

A well-known technique for computing x", when n is a nonnegative 
integer, involves the binary representation of n. The technique applies 
to the evaluation of an expression of the form x *x*x*x*...°x 
where • is any associative operation, such as addition, multiplication 
including matrix multiplication, and string concatenation (as 
suggested by the notation ('ab')3 = 'ababab'). As an example, suppose 


we wish to compute y = x!3. Because 13 expressed in binary is 1101 
(that is, 13 = 8 + 4 + 1), 


Thus, x13 may be computed as follows: 
tex 
her 
her 
Yeh hX 


This requires five multiplications, considerably fewer than the 12 that 
would be required by repeated multiplication by x. 


If the exponent is a variable, known to be a nonnegative integer, the 
technique can be employed in a subroutine, as shown in Figure 11-6. 


Figure 11-6 Computing x" by binary decomposition of n. 


int iexp(int x, unsigned n) { 


int p, y; 


y = 1; // Initialize result 


p= x; // and p. 


while(1l) { 
if (n & 1) y = pry; // If n is odd, mult by p. 
n= n >> 1; // Position next bit of n. 
if (n == 0) return y; // If no more bits inn. 
p = p*p; // Power for next bit of n. 


The number of multiplications done by this method is, for exponent n 


1, 


[logn | + nbits(rn) — 1. 


This is not always the minimal number of multiplications. For 
example, for n = 27 the binary decomposition method computes 


ylo yä. ye xl, 


which requires seven multiplications. However, the scheme illustrated 
by 


(yy 


requires only six. The smallest number for which the binary 
decomposition method is not optimal is n =15 (hint: x15 = (x3)5). 


Perhaps surprisingly, there is no known simple method that, for all n, 

finds an optimal sequence of multiplications to compute x”. The only 

known methods involve an extensive search. The problem is discussed 
at some length in [Knu2, sec. 4.6.3]. 


The binary decomposition method has a variant that scans the binary 


representation of the exponent in left-to-right order [Rib, 32], which is 
analogous to the left-to-right method of converting binary to decimal. 
Initialize the result y to 1, and scan the exponent from left to right. 
When a 0 is encountered, square y. When a 1 is encountered, square y 


Wo. 1D, 
and multiply it by x. This computes -* Ws x $ 


(((1" xy oxy? ox, 


as 


It always requires the same number of (nontrivial) multiplications as 
the right-to-left method of Figure 11-6. 


2” in Fortran 


The IBM XL Fortran compiler takes the definition of this function to be 


2") Qn 230, 
pown) = 4-33! » = 3], 


0, n<cOorn > 32. 


It is assumed that n and the result are interpreted as signed integers. 
The ANSI/ISO Fortran standard requires that the result be 0 ifn < 0. 


The definition above for n 231 seems reasonable in that it is the 
correct result modulo 232, and it agrees with what repeated 


multiplication would give. 


The standard way to compute 2" is to put the integer 1 in a register 
and shift it left n places. This does not satisfy the Fortran definition, 
because shift amounts are usually treated modulo 64 or modulo 32 
(on a 32-bit machine), which gives incorrect results for large or 
negative shift amounts. 


If your machine has number of leading zeros, pow2(n) may be 
computed in four instructions as follows [Shep]: 


x & nlzan 3 5); x 32 if OSn £31, x < 32 otherwise. 


xer»; ji xe 1 if 0 <n <31, 0 otherwise. 


pow? ex æn; 


The shift right operations are "logical" (not sign-propagating), even 
though n is a signed quantity. 


If the machine does not have the "nlz" instruction, its use above can be 
replaced with one of the x = 0 tests given in "Comparison Predicates" 


JE di 
on page 21, changing the expression ¥ => a pay = 31 


better method is to realize that the predicate 0 —x —31 is equivalent 


xe . Xey, 
to “ = =*= and then simplify the expression for ~ = ¥ given in the 
cited section; it becomes =x & (x - 32). This gives a solution in five 


instructions (four if the machine has and not): 


*A possibly 


x 4 n & (n —32); f/x<OiffOsns3l. 
x@xXyp3); f/x = 1ifO0<n<31, 0 otherwise. 


pow2 <x æn; 


11-4 Integer Logarithm 


low x|. 
ny the "integer logarithm" function we mean the function L 5p 
where x is a positive integer and b is an integer greater than or equal 


to 2. Usually, b = 2 or 10, and we denote these functions by "ilog2" 
and "ilog10," respectively. We use "ilog" when the base is unspecified. 


It is convenient to extend the definition to x = 0 by defining ilog(0) 
= -1 [CJS]. There are several reasons for this definition: 


The function ilog2(x) is then related very simply to the 
number of leading zeros function, nlz(x), by the formula shown 
below, including the case x = 0. Thus, if one of these functions 
is implemented in hardware or software, the other is easily 
obtained. 


ilog2ix) = 31 —nlz(x) 


It is easy to compute [log(x) ie the formula below. For 
x = 1, this formula implies that ilog(O) = -1. 


[logt] = ilog(x- 1) +1 


It makes the following identity hold for x = 1 (but it doesn't 
hold for x = 0): 


llog2(x=+2) = tlog2(x) - | 


It preserves the mathematical identity: 


| log px | = | (log,92)log,.x | 


It makes the result of ilog(x) a small dense set of integers (-1 
to 31 for ilog2(x) on a 32-bit machine, with x unsigned), 


making it directly useful for indexing a table. 


It falls naturally out of several algorithms for computing 
ilog2(x) and ilog10(x). 


Unfortunately, it isn't the right definition for "number of digits of x," 
which is ilog(x) + 1 for all x except x = 0. But it seems best to 
consider that anomalous. 


For x < 0, ilog(x) is left undefined. To extend its range of utility, we 
define the function as mapping unsigned numbers to signed numbers. 
Thus, a negative argument cannot occur. 


Integer Log Base 2 


Computing ilog2(x) is essentially the same as computing the number 
of leading zeros, which is discussed in "Counting Leading 0's" on page 
77. All the algorithms in that section can be easily modified to 
compute ilog2(x) directly, rather than by computing nlz(x) and 
subtracting the result from 31. (For the algorithm of Figure 5-11 on 
page 80, change the line return pop (~x) to return pop(x) - 
1). 


Integer Log Base 10 


This function has application in converting a number to decimal for 
inclusion into a line with leading zeros suppressed. The conversion 
process successively divides by 10, producing the least significant digit 
first. It would be useful to know ahead of time where the least 
significant digit should be placed, to avoid putting the converted 
number in a temporary area and then moving it. 


To compute ilog10(x), a table search is quite reasonable. This could be 
a binary search, but because the table is small and in many 
applications x is usually small, a simple linear search is probably best. 
This rather straightforward program is shown in Figure 11-7. 


Figure 11-7 Integer log base 10, simple table search. 


int ilog10 (unsigned x) { 
int i; 


static unsigned table[11] = {0, 9, 99, 999, 9999, 


99999, 999999, 9999999, 99999999, 999999999, 


OxFFFFFFFF}; 


for (i = -1; ; itt) { 


if (x <= table[itl]) return i; 


} 


On the basic RISC, this program can be implemented to execute in 


9+ 4| logor] 


about instructions. Thus. it executes in five to 45 


instructions, with perhaps 13 (for 10 S S09) being typical. 


The program in Figure 11-7 can easily be changed into an "in register" 
version (not using a table). The executable part of such a program is 
shown in Figure 11-8. This might be useful if the machine has a fast 
way to multiply by 10. 


Figure 11-8 Integer log base 10, repeated multiplication by 10. 


for (i = -1; i <= 8; i++) { 
if (x < p) return i; 


p = 10*p; 


return i; 


This program can be implemented to execute in about 


10 + 6| log ox | 


instructions on the basic RISC (counting the 
multiply as one instruction). This amounts to 16 instructions for 10 = 
x =*99, 


A binary search can be used, giving an algorithm that is loop-free and 
does not use a table. Such an algorithm might compare x to 104, then 


to either 102 or to 10®, and so on, until the exponent n is found such 


< 


that 102 —x < 10n + 1, The paths execute in ten to 18 instructions 
four or five of which are branches (counting the final unconditional 


branch). 


The program shown in Figure 11-9 is a modification of the binary 
search that has a maximum of four branches on any path, and is 
written in a way that favors small x. It executes in six basic RISC 


T for 10 <, Eog, and in 11 to 16 instructions for x > 


Figure 11-9 Integer log base 10, modified binary search. 


int ilog10 (unsigned x) { 
if (x > 99) 
if (x < 1000000) 
if (x < 10000) 
return 3 + ((int) (x - 1000) >> 31); 
else 
return 5 + ((int) (x - 100000) >> 31); 
else 
if (x < 100000000) 
return 7 + ((int) (x - 10000000) >> 31); 
else 
return 9 + ((int) ((x-1000000000)&~x) >> 31); 
else 
if (x > 9) return 1; 
else return ((int) (x - 1) >> 31); 
} 


The shift instructions in this program are signed shifts (which is the 


reason for the (int) casts). If your machine does not have this 
instruction, one of the alternatives below, which use unsigned shifts, 
may be preferable. These are illustrated for the case of the first 
return statement. Unfortunately, the first two require subtract from 
immediate for efficient implementation, which most machines don't 
have. The last involves adding a large constant (two instructions), but 
this does not matter for the second and third return statements, 
which require adding a large constant anyway. The large constant is 
231 - 1000. 


return 3 - ((x - 1000) >> 31); 

return 2 + ((999 - x) >> 31); 

return 2 + ((x + 2147482648) >> 31); 

An alternative for the fourth return statement is 
return 8 + ((x + 1147483648) | x) >> 31; 


where the large constant is 231 - 109. This avoids both the and not and 
the signed shift. 


Alternatives for the last if-else construction are 

return ((int) (x - 1) >> 31) | ((unsigned) (9 - x) >> 31); 
return (x > 9) + (x > 0) - 1; 

either of which saves a branch. 


If nlz(x) or ilog2(x) is available as an instruction, there are better and 
more interesting ways to compute ilog10(x). For example, the 
program in Figure 11-10 does it in two table lookups [CJS]. 


Figure 11-10 Integer log base 10 from log base 2, double table lookup. 


int ilog10 (unsigned x) { 
int y; 


static unsigned char table1[33] = {9, 9, 9, 8, 8, 8, 


static unsigned table2[10] = {1, 10, 100, 1000, 10000, 


100000, 1000000, 10000000, 100000000, 1000000000}; 


y = tablel[nlz(x)]; 
if (x < table2[y]) y = y - 1; 
return y; 


} 


From tablel an approximation to ilog10(x) is obtained. The 
approximation is usually the correct value, but it is too high by 1 for x 
= 0 and for x in the range 8 to 9, 64 to 99, 512 to 999, 8192 to 9999, 
and so on. The second table gives the value below which the estimate 
must be corrected by subtracting 1. 


This scheme uses a total of 73 bytes for tables, and can be coded in 
only six instructions on the IBM System/370 [CJS] (to achieve this, 
the values in table1 must be four times the values shown). It 
executes in about ten instructions on a RISC that has number of 
leading zeros but no other esoteric instructions. The other methods to 
be discussed are variants of this. 


The first variation eliminates the conditional branch that results from 
the if statement. Actually, the program in Figure 11-10 can be coded 
free of branches if the machine has the set less than unsigned 
instruction, but the method to be described can be used on machines 
that have no unusual instructions (other than number of leading 
zeros). 


The method is to replace the if statement with a subtraction followed 
by a shift right of 31 so that the sign bit can be subtracted from y. A 


> 


difficulty occurs for large x (x = 231 + 109) which can be fixed by 
adding an entry to table2, as shown in Figure 11-11. 


Figure 11-11 Integer log base 10 from log base 2, double table lookup, 
branch-free. 


int ilog10 (unsigned x) { 


int y; 


static unsigned char table1[33] = {10, 9, 9, 8, 8, 8, 


static unsigned table2[11] = {1, 10, 100, 1000, 10000, 
100000, 1000000, 10000000, 100000000, 1000000000, 


O}; 


y tablel [nlz(x)]; 


y y- ((x - table2[y]) >> 31); 


return y; 


} 


This executes in about 11 instructions on a RISC that has number of 
leading zeros but is otherwise quite "basic." It can be modified to 
return the value 0, rather than -1, for x = 0 (which is preferable for 
the decimal conversion problem) by changing the last entry in 
tablel to 1 (that is, by changing "0, 0, 0, 0" to "0, 0, 0, 1"). 


The next variation replaces the first table lookup with a subtraction, a 
multiplication, and a shift. This seems likely to be possible because 
logiox and logox are related by a multiplicative constant, namely 
logio2 = 0.30103 .... Thus, it may be possible to compute ilog10(x) by 


qos: n 
computing Le ilog2(a Weor some suitable c “0.30103, and 
correcting the result by using a table such as table2 in Figure 11-11. 


To pursue this, let log392 = c + e, where c > 0 is a rational 
approximation to log102 that is a convenient multiplier, and e > 0. 


Then for x >, 


ilog1O0(x) = | log 10% | - |(c + e)log, | 
| c logax | < ilogl0(x) = | c log,x + £ log>x | 
le ilog2( x) | < iloglO(x) s Le (ilog2(x) + 1) +e loga | 
S| c ilog2(x) + c + € log,x | 


< Lc ilog2(x)]+[e +€ log,x | + 1. 


À P 
Thus, if we choose c so that c + elogə2x < 1, then Le ilog2 (x) J 
approximates ilog10(x) with an error of 0 or +1. Furthermore, if we 


take ilog2(0) = ilog10(0) = -1, then Lcilog2(0) J = ilog10(0) 


(because 0 < c — 1), so we need not be concerned about this case. 
(There are other definitions that would work here, such as ilog2(0) = 
ilog10(0) = 0). 


Because e = logio2 - c, we must choose c so that 
c+ (log, 92—c)log,x< 1, or 


e(log.x = I )> (log 92) log 5x =I. 


This is satisfied for x = 1 (because c < 1) and 2. For larger x, we 
must have 
,, Ugo? Jlog,x— l 
logy- 1 


The most stringent requirement on c occurs when x is large. For a 32- 
bit machine, x < 232, so choosing 


ra i 
„ 9.30103 32 — | 


= 0.27846 
32-1 


suffices. Because c < 0.30103 (because e > 0), c = 9/32 = 0.28125 
is a convenient value. Experimentation reveals that coarser values 
such as 5/16 and 1/4 are not adequate. 


This leads to the scheme illustrated in Figure 11-12, which estimates 
low and then corrects by adding 1. It executes in about 11 instructions 
on a RISC that has number of leading zeros, counting the multiply as 
one instruction. 


Figure 11-12 Integer log base 10 from log base 2, one table lookup. 


static unsigned table2[10] = {0, 9, 99, 999, 9999, 


99999, 999999, 9999999, 99999999, 999999999}; 


y = (9* (31 - nlz(x))) >> 5; 
if (x > table2[y+1]) y = y +1; 
return y; 


This can be made into a branch-free version, but again there is a 
difficulty with large x (x > 231 + 109) which can be fixed in either of 
two ways. One way is to use a different multiplier (19/64) and a 
slightly expanded table. The program is shown in Figure 11-13 (about 
11 instructions on a RISC that has number of leading zeros, counting 
the multiply as one instruction). 


Figure 11-13 Integer log base 10 from log base 2, one table lookup, branch- 
free. 


int ilog10 (unsigned x) { 
int y; 
static unsigned table2[11] = {0, 9, 99, 999, 9999, 
99999, 999999, 9999999, 99999999, 999999999, 


OxFFFFFFFF }; 


y = (19* (31 - nlz(x))) >> 6; 


y = y + ((table2[y+1] - x) >> 31); 
return y; 


} 


The other "fix" is to or x into the result of the subtraction, to force the 


= 


sign bit to be on for x 
line of Figure 11-12 to 


231; that is, change the second executable 
y = y + (((table2[y+1] - x) | x) >> 31); 


This is the preferable program if multiplication by 19 is substantially 
more difficult than multiplication by 9 (as it is for a shift-and-add 
sequence). 


For a 64-bit machine, choosing 


= 0.28993 
64-1 


suffices. The value 19/64 = 0.296875 is convenient, and 
experimentation reveals that no coarser value is adequate. The 
program is (branch-free version) 


unsigned table2[20] = {0, 9, 99, 999, 9999, ..., 


9999999999999999999}; 


y ((19* (63 - nlz(x)) >> 6; 
y = y + ((table2[y+1] - x) >> 63; 


return y; 


Chapter 12. Unusual Bases for Number Systems 


This section discusses a few unusual positional number systems. They 
are just interesting curiosities and are probably not practical for 
anything. We limit the discussion to integers, but they can all be 
extended to include digits after the radix point—which usually, but 
not always, denotes non-integers. 


12-1 Base -2 


By using -2 as the base, both positive and negative integers can be 
expressed without an explicit sign or other irregularity such as having 
a negative weight for the most significant bit (Knu3). The digits used 
are 0 and 1, as in base + 2; that is, the value represented by a string of 
1's and 0's is understood to be 


(4, -+-€4454)dy) = a,(-2) + ...+43(-2) +4,(-2)" +a (-2) + dp. 


From this, it can be seen that a procedure for finding the base -2, or 
"negabinary," representation of an integer is to successively divide the 
number by -2, recording the remainders. The division must be such 
that it always gives a remainder of 0 or 1 (the digits to be used); that 
is, it must be modulus division. As an example, the plan below shows 
how to find the base -2 representation of -3. 


— 

= = F rø 
— = Z rem | 
-} 

a) 
= = =] rem Ô 
—2 

1 _ 
— = Í rem! 
-i 

1 _ 
— = Orem! 
a 


Because we have reached a 0 quotient, the process terminates (if 
continued, the remaining quotients and remainders would all be 0). 
Thus, reading the remainders upwards, we see that -3 is written 1101 
in base -2. 


Table 12-1 shows, on the left, how each bit pattern from 0000 to 1111 
is interpreted in base -2, and on the right, how integers in the range 
-15 to +15 are represented. 


Table 12-1. Conversions between Decimal and Base -2 


mmn) 


1240000 
130011 
140000 
190001 


It is not obvious that the 2" possible bit patterns in an n-bit word 
uniquely represent all integers in a certain range, but this can be 
shown by induction. The inductive hypothesis is that an n-bit word 
represents all integers in the range 


Equation la 


—(2"*!_ 2)/3 to (2" —1)/3 for n even, and 


Equation 1b 


(—(2" —2)/3) to ((2"*! — 1)/3) for 7 odd. 


Assume first that n is even. For n = 2, the representable integers are 
10, 11, 00, and 01 in base -2, or 


—2, 1, 0, 1. 


This agrees with (1a), and each integer in the range is represented 
once and only once. 


A word of n + 1 bits can, with a leading bit of 0, represent all the 
integers given by (1a). In addition, with a leading bit of 1, it can 
represent all these integers biased by (-2)" = 22. The new range is 


20 (2941 _ 2) /3 to 2" + (2"-1)/3, 


or 


(2"—1)/3 + 1 to (2"*2-1)/3. 


This is contiguous to the range given by (1a), so for a word size of n 
+ 1 bits, all integers in the range 


—(2"*!_2)/3 to (2"*2-1)/3 


are represented once and only once. This agrees with (1b), with n 
replaced by n + 1. 


The proof that (1a) follows from (1b), for n odd, and that all integers 
in the range are uniquely represented, is similar. 


To add and subtract, the usual rules, such asO + 1 = 1 and1-1 = 
0, of course apply. Because 2 is written 110, and -1 is written 11, and 
so on, the following additional rules apply. These, together with the 
obvious ones, suffice. 


l+#1= 110 
ll+1=0 
l+14+1 = 111 
0-1 
li-I 


| 1 
m=i —_ 
SS —_ 


When adding or subtracting, there are sometimes two carry bits. The 
carry bits are to be added to their column, even when subtracting. It is 


convenient to place them both over the next bit to the left, and 
simplify (when possible) using 11 + 1 = 0. If 11 is carried toa 
column that contains two 0's, bring down a 1 and carry a 1. Below are 
examples. 


Addition Subtraction 
11 1 11 11 1 11 1 1 
1 0 1 1 1 19 1 0 1 0 1 21 
+11 0 1 0 1 +(-11) -1 0 1 1 1 0 -(-38) 
O 1 1 0 0 0 8 1 0 0 1 1 1 1 59 


The only carries possible are 0, 1, and 11. Overflow occurs if there is a 
carry (either 1 or 11) out of the high-order position. These remarks 
apply to both addition and subtraction. 


Because there are three possibilities for the carry, a base -2 adder 
would be more complex than a two's-complement adder. 


There are two ways to negate an integer. It may be added to itself 
shifted left one position (that is, multiply by -1), or it may be 
subtracted from 0. There is no rule as simple and convenient as the 
"complement and add 1" rule of two's-complement arithmetic. In 
two's-complement, this rule is used to build a subtracter from an adder 


(to compute A - B form “4 + Æ + l). There does not seem to be any 
such simple device for base -2. 


Multiplication of base -2 integers is straightforward. Just use the rule 
that 1 x 1 = 1 and 0 times either 0 or 1 is 0, and add the columns 
using base -2 addition. 


Division, however, is quite complicated. It is a real challenge to devise 
a reasonable hardware division algorithm—that is, one based on 
repeated subtraction and shifting. Figure 12-1 shows an algorithm that 
is expressed, for definiteness, for an 8-bit machine. It does modulus 
division (nonnegative remainder). 


Figure 12-1 Division in base -2. 


int divbm2(int n, int d) { // q = n/d in base -2. 


r=n; // 
dw = (-128)*d; // 
c = (-43)*d; // 


q = 0; // 
for (i = 7; i >= 0; i—-) { 
ifp a oO: eed) ss 0 oe ee) 
q=q| Ck << i); // 
r= r - dw; // 
} 
dw = dw/(-2); // 
if (d > 0) c =c = 2*d; // 
else c = c + d; // 
c = c/(-2); 
} 
return q; // 
// 
// 
} // 


Init. remainder. 
Position d. 


Init. comparand. 


Init. quotient. 


Set a quotient bit. 


Subtract d shifted. 


Position d. 
Set comparand for 


next iteration. 


Return quotient in 
base -2. 
Remainder is r, 


0 <= r< |al. 


Although this program is written in C and was tested on a binary 
two's-complement machine, that is immaterial—it should be viewed 
somewhat abstractly. The input quantities n and ad, and all internal 
variables except for q, are simply numbers without any particular 
representation. The output q is a string of bits to be interpreted in 


base -2. 


This requires a little explanation. If the input quantities were in base 
-2, the algorithm would be very awkward to express in an executable 
form. For example, the test "if (da > 0)" would have to test that the 
most significant bit of dis in an even position. The addition in"c = 

c + d" would have to be a base -2 addition. The code would be very 
hard to read. The way the algorithm is coded, you should think of n 
and das numbers without any particular representation. The code 
shows the arithmetic operations to be performed, whatever encoding 
is used. If the numbers are encoded in base -2, as they would be in 
hardware that implements this algorithm, the multiplication by -128 is 
a left shift of seven positions, and the divisions by -2 are right shifts of 
one position. 


As examples, the code computes values as follows: 
divbm2(6, 2) = 7 (six divided by two is 111-2) 

divbm2(-4, 3) = 2 (minus four divided by three is 10-2) 
divbm2(-4, -3) = 6 (minus four divided by minus 3 is 110-2) 


The step q = q | (1 << i); represents simply setting bit i of q. 
The next line— r = r - dw—represents reducing the remainder by 
the divisor d shifted left. 


The algorithm is difficult to describe in detail, but we will try to give 
the general idea. 


Consider determining the value of the first bit of the quotient, bit 7 of 
q. In base -2 8-bit numbers that have their most significant bit "on" 
range in value from -170 to -43. Therefore, ignoring the possibility of 
overflow, the first (most significant) quotient bit will be 1 if (and only 
if) the quotient will be algebraically less than or equal to -43. 


Because n = qd + rand for a positive divisor r Sa - 1, for a positive 


divisor the first quotient bit will be 1 iff n ==- 43d + (d- 1), orn < - 
43d + d. For a negative divisor, the first quotient bit will be 1 iffn 
= 43d (r = 


=. =.0 for modulus division). 


Thus, the first quotient bit is 1 if 


(d> 0 & =(n 2=-—43d+d)) l (d<O0&n2-43¢d). 


Ignoring the possibility that d = 0 this can be written as 


d>QO@n2c, 


where c = - 43d + difd > 0, andc = - 43difd < 0. 


This is the logic for determining a quotient bit for an odd-numbered 
bit position. For an even-numbered position, the logic is reversed. 
Hence the test includes the term (i&1) == 0. (The ^ character in 
the program denotes exclusive or.) 


At each iteration, c is set equal to the smallest (closest to zero) 
integer that must have a 1-bit at position i after dividing by d. If the 
current remainder r exceeds that, then bit i of qis setto 1 and ris 
adjusted by subtracting the value of a 1 at that position, multiplied by 
the divisor d. No real multiplication is required here; d is simply 
positioned properly and subtracted. 


The algorithm is not elegant. It is awkward to implement because 
there are several additions, subtractions, and comparisons, and there 
is even a multiplication (by a constant) that must be done at the 
beginning. One might hope for a "uniform" algorithm—one that does 
not test the signs of the arguments and do different things depending 
on the outcome. Such a uniform algorithm, however, probably does 
not exist for base -2 (or for two's-complement arithmetic). The reason 
for this is that division is inherently a non-uniform process. Consider 
the simplest algorithm of the shift-and-subtract type. This algorithm 
would not shift at all, but for positive arguments would simply 
subtract the divisor from the dividend repeatedly, counting the 
number of subtractions performed until the remainder is less than the 
divisor. But if the dividend is negative (and the divisor is positive), the 
process is to add the divisor repeatedly until the remainder is 0 or 
positive, and the quotient is the negative of the count obtained. The 
process is still different if the divisor is negative. 


In spite of this, division is a uniform process for the signed-magnitude 
representation of numbers. With such a representation, the 
magnitudes are positive, so the algorithm can simply subtract 
magnitudes and count until the remainder is negative, and then set 
the sign bit of the quotient to the exclusive or of the arguments, and 
the sign bit of the remainder equal to the sign of the dividend (this 
gives ordinary truncating division). 


The algorithm given above could be made more uniform, in a sense, 
by first complementing the divisor, if it is negative, and then 
performing the steps given as simplified by having d > 0. Thena 
correction would be performed at the end. For modulus division, the 
correction is to negate the quotient and leave the remainder 
unchanged. This moves some of the tests out of the loop, but the 
algorithm as a whole is still not pretty. 


It is interesting to contrast the commonly used number representations 
and base -2 regarding the question of whether or not the computer 
hardware treats numbers uniformly in carrying out the four 
fundamental arithmetic operations. We don't have a precise definition 
of "uniformly," but basically it means free of operations that might or 
might not be done, depending on the signs of the arguments. We 
consider setting the sign bit of the result equal to the exclusive or of 
the signs of the arguments to be a uniform operation. Table 12-2 
shows which operations treat their operands uniformly with various 
number representations. 


One's-complement addition and subtraction are done uniformly by 
means of the "end around carry" trick. For addition, all bits, including 
the sign bit, are added in the usual binary way, and the carry out of 
the leftmost bit (the sign bit) is added to the least significant position. 
This process always terminates right away (that is, the addition of the 
carry cannot generate another carry out of the sign bit position). 


Table 12-2. Uniform Operations in Various Number Encodings 
Bgasitompianait 
adslition 
yebdtraction 
yesltiplication 
deésision 


In the case of two's-complement multiplication, the entry is "yes" if 
only the right half of the doubleword product is desired. 


We conclude this discussion of the base -2 number system with some 
observations about how to convert between straight binary and base 
-2. 


To convert to binary from base -2, form a word that has only the bits 
with positive weight, and subtract a word that has only the bits with 
negative weight, using the subtraction rules of binary arithmetic. An 
alternative method that may be a little simpler is to extract the bits 
appearing in the negative weight positions, shift them one position to 


the left, and subtract the extracted number from the original number 
using the subtraction rules of ordinary binary arithmetic. 


To convert to base -2 from binary, extract the bits appearing in the 
odd positions (positions weighted by 2" with n odd), shift them one 
position to the left, and add the two numbers using the addition rules 
of base -2. Here are two examples: 


Binary from base -2 Base -2 from binary 
110111 (-13) 110111 (55) 

— 1 0 1 (binary subtract) + 101 (base -2 add) 
»»»LI1110011 (713) 1001011 (55) 


On a computer, with its fixed word size, these conversions work for 
negative numbers if the carries out of the high-order position are 
simply discarded. To illustrate, the example on the right above can be 
regarded as converting -9 to base -2 from binary if the word size is six 
bits. 


The above algorithm for converting to base -2 cannot easily be 
implemented in software on a binary computer, because it requires 
doing addition in base -2. Schroeppel [HAK, item 128] overcomes this 
with a much more clever and useful way to do the conversions in both 
directions. To convert to binary, his method is 


B + (N ® 0b10...1010)-— üb10...1010. 


To see why this works, let the base -2 number consist of the four digits 
abcd. Then, interpreted (erroneously) in straight binary, this is 8a + 
4b + 2c + d. After the exclusive or, interpreted in binary it is 8(1 - a) 
+ 4b + 2(1 - c) + d. After the (binary) subtraction of 8 + 2, it is - 8a 
+ 4b - 2c + d, which is its value interpreted in base -2. 


Schroeppel's formula can be readily solved for N in terms of B, so it 
gives a three-instruction method for converting in the other direction. 
Collecting these results, we have the following formulas for converting 
to binary, for a 32-bit machine: 


Be (N & 0x55555555) — (N & <0x55555555), 
BH N-((N & OXAAAAAAAA) & 1), 
B < (N @ OxXAAAAAAAA) — OXAAAAAAAA, 


and the following, for converting to base -2 from binary: 


N + (B+ O0xAAAAAAAA) ® OXAAAAAAAA. 


12-2 Base -1 +i 


By using - 1 + ias the base, where i is vl *all complex integers 
(complex numbers with integral real and imaginary parts) can be 


expressed as a single "number" without an explicit sign or other 
irregularity. Surprisingly, this can be done using only 0 and 1 for 
digits, and all integers are represented uniquely. We will not prove 
this or much else about this number system, but will just describe it 
very briefly. 


It is not entirely trivial to discover how to write the integer 2. [1] 
However, this can be determined algorithmically by successively 
dividing 2 by the base and recording the remainders. What does a 
"remainder" mean in this context? We want the remainder after 
dividing by - 1 + ito be 0 or 1, if possible (so that the digits will be 0 
or 1). To see that it is always possible, assume that we are to divide an 
arbitrary complex integer a + bi by - 1 + i. Then, we wish to find q 
and r such that q is a complex integer, r = 0 or 1, and 


[1] The interested reader might warm up to this challenge. 


da+bi = (q, + qi- I++, 


where qr and qi denote the real and imaginary parts of q, respectively. 
Equating real and imaginary parts and solving the two simultaneous 
equations for q gives 


b-atr 
=. and 


| 


-ad-bh+r 
i ae 


Clearly, if a and b are both even or are both odd, then by choosing r 
= 0, q is a complex integer. Furthermore, if one of a and b is even 
and the other is odd, then by choosing r = 1, q is a complex integer. 


Thus, the integer 2 can be converted to base - 1 + i by the plan 
illustrated below. 


Because the real and imaginary parts of the integer 2 are both even, 
we simply do the division, knowing that the remainder will be 0: 


5 ae) 2 
= = SC) 5 trem, 
-l+i (-14+(-1-1) 


Because the real and imaginary parts of - 1 - i are both odd, again we 
simply divide, knowing that the remainder is 0: 


-l-i _ (-1-i-1-0 
Z1+} (-1+ik-1-}) 


= jrem 0. 


Because the real and imaginary parts of i are even and odd, 
respectively, the remainder will be 1. It is simplest to account for this 
at the beginning by subtracting 1 from the dividend. 


I=L = | (remainder is 1). 
l +i 


Because the real and imaginary parts of 1 are odd and even, the next 
remainder will be 1. Subtracting this from the dividend gives 


a = 0 (remainder is 1). 
l+i 


Because we have reached a 0 quotient, the process terminates, and the 
base - 1 + irepresentation for 2 is seen to be 1100 (reading the 
remainders upwards). 


Table 12-3 shows how each bit pattern from 0000 to 1111 is 
interpreted in base - 1 + i, and how the real integers in the range -15 
to +15 are represented. 


The addition rules for base - 1 + i (in addition to the trivial ones 
involving a 0 bit) are as follows: 


l+1 = 1100 

[+i+1 = 1101 
l+l+l+l = 111010000 
l+l+l+i+i = 111010001 
Ll+1+1414141 = 111011100 
l+l+l+l+l+l+l = 111011101 
l+l+l+l+1l1+1+1+1 = 111000000 


Table 12-3. Conversions between Decimal and Base -1 + I 
<n ((lifakee ith at) ii) 
0 
11101 
20000 
0001 
40000000 
500210001 
4119011000 
710000001 
808020000 
900020001 
100601000 
200080001 
2080000000 
8000100001101 
18801 00001100 
28001 00000001 


When adding two numbers, the largest number of carries that occurs 
in one column is six, so the largest sum of a column is 8 (111000000). 
This makes for a rather complicated adder. If one were to build a 
complex arithmetic machine, it would no doubt be best to keep the 
real and imaginary parts separate, [2] with each represented in some 
sensible way such as two's-complement. 


[2] This is the way it was done at Bell Labs back in 1940 on George 
Stibitz's Complex Number Calculator [Irvine]. 


12-3 Other Bases 


The base - 1 - i has essentially the same properties as the base - 1 + i 

discussed above. If a certain bit pattern represents the number a + bi 
in one of these bases, then the same bit pattern represents the number 
a - bi in the other base. 


The bases 1 + i and 1 - ican also represent all the complex integers, 
using only 0 and 1 for digits. These two bases have the same complex- 
conjugate relationship to each other, as do the bases -1 + i. In bases 1 
+ i, the representation of some integers has an infinite string of 1's on 
the left, similar to the two's-complement representation of negative 
integers. This arises naturally by using uniform rules for addition and 
subtraction, as in the case of two's-complement. One such integer is 2, 
which (in either base) is written ...11101100. Thus, these bases have 
the rather complex addition rule 1 + 1 = ...11101100. 


By grouping into pairs the bits in the base -2 representation of an 
integer, one obtains a base 4 representation for the positive and 
negative numbers, using the digits -2, -1, 0, and 1. For example, 


0 


14 secimat = 110110_, = (-1)(1)(-2), = -1-4°4+1-4'-2-4°. 


Similarly, by grouping into pairs the bits in the base - 1 + i 
representation of a complex integer, we obtain a base -2i 
representation for the complex integers using the digits 0, 1, - 1 + i, 
and i. This is a bit too complicated to be interesting. 


The "quater-imaginary" system (Knu2) is similar. It represents the 
complex integers using 2i as a base, and the digits 0, 1, 2, and 3 (with 
no sign). To represent some integers, namely those with an odd 
imaginary component, it is necessary to use a digit to the right of the 
radix point. For example, i is written 10.2 in base 2i. 


12- 4 What Is the Most Efficient Base? 


Suppose you are building a computer and you are trying to decide 
what base to use to represent integers. For the registers you have 
available circuits that are 2-state (binary), 3-state, 4-state, and so on. 
Which should you use? 


Let us assume that the cost of a b-state circuit is proportional to b. 
Thus, a 3-state circuit costs 50% more than a binary circuit, a 4-state 
circuit costs twice as much as a binary circuit, and so on. 


Suppose you want the registers to be able to hold integers from 0 to 
some maximum M. Encoding integers from 0 to M in base b requires 


za l 
[ log, (M+ l i ee (e.g., to pnei all integers from 0 to 
999,999 in decimal requires logio( ,000,000) = 6 digits). 


One would expect the cost of a register to be equal to the product of 
the number of digits required times the cost to represent each digit: 


c = klog (M+ 1)- h, 


where c is the cost of a register and k is a constant of proportionality. 
For a given M, we wish to find b that minimizes the cost. 


The minimum of this function occurs for that value of b that makes 
dc/db = 0. Thus, we have 


L (kblog,(M + 1)) = pares 2) Inb- 1 
dh dh 


= kin(M + 1)——-. 
Inb (Inb) 


This is zero when Inb = 1, orb = e. 


a 
This is not a very satisfactory result. Because e “2.718, 2 and 3 
must be the most efficient integral bases. Which is more efficient? The 


ratio of the cost of a base 2 register to the cost of a base 3 register is 


7 k-21 LEN M+l ? 2 b 
c(2) _ k: 2log(M+1) _ 2in(M+1)/(In2) _ 2in3 _ | 956, 


c(3) k- 3log(M+1)  3In(M+1)/(In3) 3in2 


Thus, base 2 is more costly than base 3, but only by a small amount. 


By the same analysis, base 2 is more costly than base e by a factor of 
about 1.062. 


Chapter 13. Gray Code 


Gray Code 
Incrementing a Gray-Coded Integer 
Negabinary Gray Code 


Brief History and Applications 


13-1 Gray Code 


Is it possible to cycle through all 2. combinations of n bits by 
changing only one bit at a time? The answer is "yes," and this is the 
defining property of Gray codes. That is, a Gray code is an encoding of 
the integers such that a Gray-coded integer and its successor differ in 
only one bit position. This concept can be generalized to apply to any 
base, such as decimal, but here we will discuss only binary Gray 
codes. 


Although there are many binary Gray codes, we will discuss only one: 
the "reflected binary Gray code." This code is what is usually meant in 
the literature by the unqualified term "Gray code." We will show, 
usually without proof, how to do some basic operations in this 
representation of integers, and we will show a few surprising 
properties. 


The reflected binary Gray code is constructed as follows. Start with 
the strings 0 and 1, representing the integers 0 and 1: 


0 
l 


Reflect this about a horizontal axis at the bottom of the list, and place 
a 1 to the left of the new list entries, and a 0 to the left of the original 
list entries: 


OO 
ül 
I] 
10 


This is the reflected binary Gray code for n = 2. To get the code for n 
= 3, reflect this and attach a 0 or 1 as before: 


OOO 
(MT 
üll 
O10 
HO 
111 
10] 
100 


From this construction, it is easy to see by induction on n that (1) each 
of the 2" bit combinations appears once and only once in the list, (2) 
only one bit changes in going from one list entry to the next, and (3) 
only one bit changes when cycling around from the last entry to the 
first. Gray codes having this last property are called "cyclic," and the 
reflected binary Gray code is necessarily cyclic. 


If n > 2, there are non-cyclic codes that take on all 2" values once 
and only once. One such code is 000 001 011 010 110 100 101 111. 


Figure 13-1 shows, for n = 4, the integers encoded in ordinary binary 
and in Gray code. The formulas show how to convert from one 
representation to the other at the bit-by-bit level (as it would be done 
in hardware). 


Figure 13-1. 4-bit Gray code and conversion formulas. 


Binary Gray 


abed efgh 


0000 0000 Gray from Binary Binary from Gray 


0001 0001 e=a a=e 

0010 0011 f=a®b b=e®/ 
0011 0010 g=b@c c=e@fOg 
0100 0110 h=c®d d=eD/DgDh 
0101 Ol 

0110 0101 

0111 0100 

1000 1100 

1001 1101 

1010 1111 

1011 1110 

1100 1010 

1101 1011 

1110 1001 

1111 1000 


As for the number of Gray codes on n bits, notice that one still has a 
cyclic binary Gray code after rotating the list (starting at any of the 22 
positions and cycling around) or reordering the columns. Any 
combination of these operations results in a distinct code. Therefore, 
there are at least 2" - n! cyclic binary Gray codes on n bits. There are 


more than this for n 3. 


The Gray code and binary representations have the following dual 
relationships, evident from the formulas given in Figure 13-1: 


Bit i of a Gray-coded integer is the parity of bit i and the bit to 
the left of i in the corresponding binary integer (using 0 if 
there is no bit to the left of i). 


Bit i of a binary integer is the parity of all the bits at and to 
the left of position i in the corresponding Gray-coded integer. 


Converting to Gray from binary can be done in only two instructions: 


GeB@(B4 1). 


The conversion to binary from Gray is harder. One method is given by 


N-i 
Be @Gsi. 
fell 


We have already seen this formula in "Computing the Parity of a 
Word" on page 74. As mentioned there, this formula can be evaluated 
as illustrated below for n = 32. 


B =G^ (G >> 1); 
B =B ^ (B >> 2); 
B =B ^ (B >> 4); 
B =B ^ (B >> 8); 


B =B ^ (B >> 16); 


, , 2. [logan], , 
Thus, in general it requires =  linstructions. 


Because it is so easy to convert from binary to Gray, it is trivial to 
generate successive Gray-coded integers: 


for (i = 0; i < n; i++) { 
G =i ^ (i >> 1); 


output G; 


13-2 Incrementing a Gray-Coded Integer 


The logic for incrementing a 4-bit binary integer abcd can be 
expressed as follows, using Boolean algebra notation: 


d’=d 
ec =cOd 
bo = b@ed 


‘= a hed 


a 
il 


Thus, one way to build a Gray-coded counter in hardware is to build a 


! 


binary counter using the above logic, and convert the outputs a’, b’, c’, 
d' to Gray by forming the exclusive or of adjacent bits, as shown under 
"Gray from Binary" in Figure 13-1. 


A way that might be slightly better is described by the following 
formulas: 


p=eOfPOrGh 
h = hep 

eo =e @hp 

f = f@ghp 

e =e@fehp 


That is, the general case is 


G; = G, ®(G,,_;Gn—2-..Gop), n>? 


Because the parity p alternates between 0 and 1, a counter circuit 
might maintain p in a separate 1-bit register and simply invert it on 
each count. 


In software, the best way to find the successor G' of a Gray-coded 
integer G is probably simply to convert G to binary, increment the 
binary word, and convert it back to Gray code. Another way that's 
interesting and almost as good is to determine which bit to flip in G. 
The pattern goes like this, expressed as a word to be exclusive or'd to 
G: 


121412181214121 16 


The alert reader will recognize this as a mask that identifies the 
position of the leftmost bit that changes when incrementing the 
integer 0, 1, 2, 3, ..., corresponding to the positions in the above list. 
Thus, to increment a Gray-coded integer G, the bit position to invert is 
given by the leftmost bit that changes when 1 is added to the binary 
integer corresponding to G. 


This leads to the following algorithms for incrementing a Gray-coded 


integer G. They both first convert G to binary, which is shown as 
index (G). 


Figure 13-2 Incrementing a Gray-coded integer. 


B = index (G); B = index (G); 


B B + 1; M ~B & (B + 1); 


Gp = B ^ (B >> 1); Gp = G ^ M; 


A pencil-and-paper method of incrementing a Gray-coded integer is as 
follows: 


Starting from the right, find the first place at which the parity of bits 
at and to the left of the position is even. Invert the bit at this position. 


Or, equivalently: 
Let p be the parity of the word G. If p is even, invert the rightmost bit. 
If p is odd, invert the bit to the left of the rightmost 1-bit. 


The latter rule is directly expressed in the Boolean equations given 
above. 


13-3 Negabinary Gray Code 


If you write the integers in order in base -2, and convert them using 
the "shift and exclusive or" that converts to Gray from straight binary, 
you get a Gray code. The 3-bit Gray code has indexes that range over 
the 3-bit base -2 numbers, namely -2 to 5. Similarly, the 4-bit Gray 
code corresponding to 4-bit base -2 numbers has indexes ranging from 
-10 to 5. It is not a reflected Gray code, but it almost is. The 4-bit Gray 
code can be generated by starting with 0 and 1, reflecting this about a 
horizontal axis at the top of the list, and then reflecting it about a 
horizontal axis at the bottom of the list, and so on. It is cyclic. 


To convert back to base -2 from this Gray code, the rules are of course 
the same as they are for converting to straight binary from ordinary 
reflected binary Gray code (because these operations are inverses, no 
matter what the interpretation of the bit strings is). 


13-4 Brief History and Applications 


Gray codes are named after Frank Gray, a physicist at Bell Telephone 
Laboratories who in the 1930's invented the method we now use for 
broadcasting color TV in a way that's compatible with the black-and- 
white transmission and reception methods then in existence; that is, 
when the color signal is received by a black-and-white set, the picture 
appears in shades of gray. 


Martin Gardner [Gard] discusses applications of Gray codes involving 
the Chinese ring puzzle, the Tower of Hanoi puzzle, and Hamiltonian 
paths through graphs that represent hypercubes. He also shows how to 
convert from the decimal representation of an integer to a decimal 
Gray code representation. 


Gray codes are used in position sensors. A strip of material is made 
with conducting and nonconducting areas, corresponding to the 1's 
and 0's of a Gray-coded integer. Each column has a conducting wire 
brush positioned to read it out. If a brush is positioned on the dividing 
line between two of the quantized positions, so that its reading is 
ambiguous, then it doesn't matter which way the ambiguity is 
resolved. There can be only one ambiguous brush, and interpreting it 
as a 0 or 1 gives a position adjacent to the dividing line. 


The strip can instead be a series of concentric circular tracks, giving a 
rotational position sensor. For this application, the Gray code must be 
cyclic. Such a sensor is shown in Figure 13-3, where the four dots 
represent the brushes. 


Figure 13-3. Rotational position sensor. 


Chapter 14. Hilbert's Curve 


In 1890 Giuseppe Peano discovered a planar curve [1] with the rather 
surprising property that it is "space-filling." The curve winds around 
the unit square and hits every point (x, y) at least once. 


[1] Recall that a curve is a continuous map from a one-dimensional 
space to an n-dimensional space. 


Peano's curve is based on dividing each side of the unit square into 
three equal parts, which divides the square into nine smaller squares. 
His curve traverses these nine squares in a certain order. Then, each of 
the nine small squares is similarly divided into nine still smaller 
squares, and the curve is modified to traverse all these squares in a 
certain order. The curve can be described using fractions expressed in 
base 3; in fact, that's the way Peano first described it. 


In 1891 David Hilbert [Hil] discovered a variation of Peano's curve 
based on dividing each side of the unit square into two equal parts, 
which divides the square into four smaller squares. Then, each of the 
four small squares is similarly divided into four still smaller squares, 
and so on. For each stage of this division, Hilbert gives a curve that 
traverses all the squares. Hilbert's curve, sometimes called the "Peano- 
Hilbert curve," is the limit curve of this division process. It can be 
described using fractions expressed in base 2. 


Figure 14-1 shows the first three steps in the sequence that leads to 
Hilbert's space-filling curve, as they were depicted in his 1891 paper. 


Figure 14-1. First three curves in the sequence defining Hilbert's curve. 


Here, we do things a little differently. We use the term "Hilbert curve" 
for any of the curves on the sequence whose limit is the Hilbert space- 
filling curve. The "Hilbert curve of order n" means the nth curve in the 
sequence. In Figure 14-1, the curves are of order 1, 2, and 3. We shift 

the curves down and to the left so that the corners of the curves 


coincide with the intersections of the lines in the boxes above. Finally, 
we scale the size of the order n curve up by a factor of 2", so that the 
coordinates of the corners of the curves are integers. Thus, our order n 
Hilbert curve has corners at integers ranging from 0 to 2” - 1 in both x 
and y. We take the positive direction along the curve to be from (x, y) 
= (0, 0) to (27 - 1, 0). On the next page are shown the "Hilbert 
curves," in our terminology, of orders 1 through 6. 


14-1 A Recursive Algorithm for Generating the Hilbert 
Curve 


To see how to generate a Hilbert curve, examine the curves in Figure 
14-2. The order 1 curve goes up, right, and down. The order 2 curve 
follows this overall pattern. First, it makes a U-shaped curve that goes 
up, in net effect. Second, it takes a unit step up. Third, it takes a U- 
shaped curve, a step, and another U, all to the right. Finally, it takes a 
step down, followed by a U that goes down, in net effect. 


Figure 14-2. Hilbert curves of orders 1-6. 


The order 1 inverted U is converted into the order 2 Y-shaped curve. 


We can regard the Hilbert curve of any order as a series of U-shaped 
curves of various orientations, each of which, except for the last, is 


followed by a unit step in a certain direction. In transforming a Hilbert 
curve of one order to the next, each U-shaped curve is transformed 
into a Y-shaped curve with the same general orientation, and each 
unit step is transformed to a unit step in the same direction. 


The transformation of the order 1 Hilbert curve (a U curve with a net 
direction to the right and a clockwise rotational orientation) to the 
order 2 Hilbert curve goes as follows: 


1. Draw a U that goes up and has a counterclockwise rotation. 
2. Draw a step up. 

3. Draw a U that goes to the right and has a clockwise rotation. 
4. Draw a step to the right. 

5. Draw a U that goes to the right and has a clockwise rotation. 
6. Draw a step down. 


7. Draw a U that goes down and has a counterclockwise 
rotation. 


We can see by inspection that all U's that are oriented as the order 1 
Hilbert curve are transformed in the same way. A similar set of rules 
can be made for transforming U's with other orientations. These rules 
are embodied in the recursive program shown in Figure 14-3 [Voor]. 
In this program, the orientation of a U curve is characterized by two 
integers that specify the net linear and the rotational directions, 
encoded as follows: 


dir = 0: right 
dir = 1: up 
dir = 2: left 


dir = 3: down 
rot = +1: clockwise 


rot = -1: counterclockwise 


Figure 14-3 Hilbert curve generator. 


void step (int); 


void hilbert (int dir, int rot, int order) { 


if (order == 0) return; 


dir = dir + rot; 


hilbert (dir, -rot, order - 1); 
step (dir); 

dir = dir - rot; 

hilbeērt (dir; rot, order = 1); 
step (dir); 

hilbeērt (dir, rot, order = 1); 
dir = dir - rot; 

step (dir); 

hilbert (dir, -rot, order - 1); 


} 


Actually, dir can take on other values, but its congruency modulo 4 
is what matters. 


Figure 14-4 shows a driver program and function step that is used 
by program hilbert. This program is given the order of a Hilbert 
curve to construct, and it displays a list of line segments, giving for 


each the direction of movement, the length along the curve to the end 


of the segment, and the coordinates of the end of the segment. For 
example, for order 2 it displays 


0 0000 00 00 


0 0001 01 00 


0 1000 10 10 
1 1001 10 11 
0 1010 11 11 
=I 1011 11 10 
=l 1100 11 O01 
-2 1101 10 01 
=T 1110 10 00 
0 ETITI 11 00 


Figure 14-4 Driver program for Hilbert curve generator. 


#include <stdio.h> 


#include <stdlib.h> 


int x = -1, y = 0; // Global variables. 
int i = 0; // Dist. along curve. 
int blen; // Length to print. 


void hilbert (int dir, int rot, int order); 


void binary (unsigned k, int len, char *s) { 
/* Converts the unsigned integer k to binary character 


form. Result is string s of length len. */ 


int Si 

s[len] = 0; 

for (i = len - 1; i >= 0; i--) { 
if (k & 1) s[i] = '1'; 
else s[i] = '0'; 


void step(int dir) { 


char ii[33], xx[17], yy[17]; 


switch(dir & 3) { 
case 0: x = x + 1; break; 
case 1: y = y + 1; break; 


case 2: x = x — 1; break; 


case 3: y = y — 1; break; 


binary(i, 2*blen, ii); 
binary (x, blen, xx); 


binary(y, blen, yy); 


int main(int argc, 


int order; 


Ss %s\n", 


char *argv[]) 


order = atoi(argv[1]); 
blen = order; 

step (0); 

hilbert(0, 1, order); 


return 0; 


dir, ii, xx, yy); 


// Increment distance. 


{ 


// Print init. point. 


14-2 Coordinates from Distance along the Hilbert Curve 


To find the (x, y) coordinates of a point located at a distance s along 
the order n Hilbert curve, observe that the most significant two bits of 
the 2n-bit integer s determine which major quadrant the point is in. 
This is because the Hilbert curve of any order follows the overall 
pattern of the order 1 curve. If the most significant two bits of s are 
00, the point is somewhere in the lower left quadrant, if 01 it is in the 
upper left quadrant, if 10 it is in the upper right quadrant, and if 11 it 
is in the lower right quadrant. Thus, the most significant two bits of s 
determine the most significant bits of the n-bit integers x and y, as 
follows: 


Most significant bate bité&ofy3 
Q@, 0) 
Qd, 1) 
(D 1) 
(i, 0) 


In any Hilbert curve, only four of the eight possible U-shapes occur. 
These are shown in Table 14-1 as graphics and as maps from two bits 
of s to a single bit of each of x and y. 


Observe from Figure 14-2 that in all cases, the U-shape represented by 


map Al [vy Í becomes, at the next level of detail, a U-shape 
represented by maps B, A, A, or D, depending on whether the length 


traversed in the first-mentioned map A is 0, 1, 2, or 3, respectively. 


Similarly, a U-shape represented by map B ( ‘ } becomes, at the 
next level of detail, a U-shape represented by maps A, B, B, or C, 


depending on whether the length traversed in the first-mentioned map 
B is 0, 1, 2, or 3, respectively. 


These observations lead to the state transition table shown in Table 
14-2, in which the states correspond to the mappings shown in Table 
14-1. 


Table 14-1. The four possible mappings 
B 


ES 


00 => (0, 0) 
01 => (0, 0) 


10 ==> 00, 0) 
(0,@) 


Table 14-2. State transition table for computing (X, Y) from S 
and the niffxthtippryiniiseiteechigs of s are 


AO, 0) 
0, 1) 
& 1) 
Al, 0) 
AD, 0) 
Gil, 0) 
BD 1) 
BL 1) 
00 1) 
Gi, 0) 
w 0) 
BL 1) 
00 1) 
UL 1) 
wW 0) 
Al, 0) 


To use the table, start in state A. The integer s should be padded with 
leading zeros so that its length is 2n, where n is the order of the 
Hilbert curve. Scan the bits of s in pairs from left to right. The first 
row of Table 14-2 means that if the current state is A and the 
currently scanned bits of s are 00, then output (0, 0) and enter state B. 
Then, advance to the next two bits of s. Similarly, the second row 
means that if the current state is A and the scanned bits are 01, then 
output (0, 1) and stay in state A. 


The output bits are accumulated in left-to-right order. When the end 
of s is reached, the n-bit output quantities x and y are defined. 


As an example, suppose n = 3 and 


= 110100, 


Because the process starts in state A and the initial bits scanned are 
11, the process outputs (1, 0) and enters state D (fourth row). Then, in 
state D and scanning 01, the process outputs (0, 1) and stays in state 
D. Lastly, the process outputs (1, 1) and enters state C, although the 
state is now immaterial. 


Thus, the output is (101, 011)—that is, x = 5 and y = 3. 


A C program implementing these steps is shown in Figure 14-5. In this 
program, the current state is represented by an integer from 0 to 3 for 
states A through D, respectively. In the assignment to variable row, 
the current state is concatenated with the next two bits of s, giving 
an integer from 0 to 15, which is the applicable row number in Table 
14-2. Variable row is used to access integers (expressed in 
hexadecimal) that are used as bit strings to represent the rightmost 
two columns of Table 14-2; that is, these accesses are in-register table 
lookups. Left-to-right in the hexadecimal values corresponds to 
bottom-to-top in Table 14-2. 


Figure 14-5 Program for computing (x, y) from s. 


void hil_xy_from_s(unsigned s, int n, unsigned *xp, 


unsigned *yp) { 


int i; 


unsigned state, x, y, row; 


state = 0; // Initialize. 

xX = y = 0; 

for (i = 2*n - 2; i >= 0; i -= 2) { // Do n times. 
row = 4*state | (s >> i) & 3; // Row in table. 
x = (x << 1) | (0x936C >> row) & 1; 
y = (y << 1) | (0x39C6 >> row) & 1; 
state = (0x3E6B94C1 >> 2*row) & 3; // New state. 


*xp = xX; // Pass back 


// results. 


[L&S] give a quite different algorithm. Unlike the algorithm of Figure 
14-5, it scans the bits of s from right to left. It is based on the 
observation that one can map the least significant two bits of s to (x, 
y) based on the order 1 Hilbert curve, and then test the next two bits 
of s to the left. If they are 00, the values of x and y just computed 
should be interchanged, which corresponds to reflecting the order 1 
Hilbert curve about the line x = y. (Refer to the curves of orders 1 
and 2 shown in Figure 14-1 on page 241.) If these two bits are 01 or 
10, the values of x and y are not changed. If they are 11, the values of 
x and y are interchanged and complemented. These same rules apply 
as one progresses leftward along the bits of s. They are embodied in 
Table 14-3 and the code of Figure 14-6. It is somewhat curious that 
the bits can be prepended to x and y first, and then the swap and 
complement operations can be done, including these newly prepended 


bits; the results are the same. 


Figure 14-6 Lam and Shapiro method for computing (x, y) from s. 


void hil_xy_from_s(unsigned s, 


int i, sa, sb; 


unsigned x, y, temp; 


for (2 = 0% i < 2%; 
sa = (s >> (i+1)) 
sb = (s >> i) & 1; 
if ((sa * sb) == 


int n, 


// 
// 


// 
// 


unsigned *xp, 


unsigned *yp) { 


Get bit i+1 of s. 


Get bit i of s. 


If sa,sb = 00 or 11, 


swap x and y, 


x = y*(-Sa); // and if sa = 1, 


y = temp^(-sa); // complement them. 

} 

x = (x >> 1) | (sa << 31); // Prepend sa to x and 

y = (y >> 1) | 4a ^ sb) << 31); // (sa^sb) to y. 
} 
*xp = x >> (32 - n); // Right-adjust x and y 
*yp = y >> (32 - n); // and return them to 

} // the caller. 


In Figure 14-6, variables x and y are uninitialized, which might 
cause an error message from some compilers. But the code functions 
correctly for whatever values x and y have initially. 


The branch in the loop of Figure 14-6 can be avoided by doing the 
swap operation with the "three exclusive or" trick given in Section 
2-19 on page 38. 


Table 14-3. Lam and Shapiro method for computing (X, Y) from S 
If the nand (podgêndwo (ity pf s are 
A x and y 
Nb thange 
No thange 
$ilya) and complement x and y 


The if block can be replaced by the following code, where swap and 
cmp1 are unsigned integers: 


swap = (sa ^ sb) - 1; // -1 if should swap, else 0. 
cmpl = -(sa & sb); // -1 if should compl't, else 0. 
x=x^ y; 

y = y ^ (x & swap) ^ cmpl; 

x=x^ y; 


However, this is nine instructions, versus about two or six for the if 


block, so the branch cost would have to be quite high for this to be a 
good choice. 


The "swap and complement" idea of [L&S] suggests a logic circuit for 
generating the Hilbert curve. The idea behind the circuit described 
below is that as you trace along the path of an order n curve, you 
basically map pairs of bits of s to (x, y) according to map A of Table 
14-1. As the trace enters various regions, however, the mapping 
output gets swapped, complemented, or both. The circuit of Figure 
14-7 keeps track of the swap and complement requirements of each 
stage, uses the appropriate mapping to map two bits of s to (xi, yj), 
and generates the swap and complement signals for the next stage. 


Figure 14-7. Logic circuit for incrementing (x, y) by one step along the 
Hilbert curve. 


= Iss O(55,5;, I@C 


2 i+ 


i+] 
— ts: aie s D 
lsa È S241 ® 2S DO Giai 
= S54) D (33E 59541) 
= Cipi B (SyS241) 


Assume there is a register containing the path length s and circuits for 
incrementing it. Then, to find the next point on the Hilbert curve, first 
increment s and then transform it as described in Table 14-4. This is a 
left-to-right process, which is a bit of a problem because incrementing 
s is a right-to-left process. Thus, the time to generate a new point on 
an order n Hilbert curve is proportional to 2n (for incrementing s) 
plus n (for transforming s to (x, y)). 


Figure 14-7 shows this computation as a logic circuit. In this figure, S 
denotes the swap signal and C denotes the complement signal. 


Table 14-4. Logic for computing (X, Y) from S 
If the nethtefta pighitdditw óh iyə of s are 


swap = swap 


Mh thknge 

NO thitige 

swap = swap, cmpl = empl 
["] 


[*] Possibly swapped and/or complemented 


The logic circuit of Figure 14-7 suggests another way to compute (x, 
y) from s. Notice how the swap and complement signals propagate 
from left to right through the n stages. This suggests that it may be 
possible to use the parallel prefix operation to quickly (in logan steps 
rather than n - 1) propagate the swap and complement information to 
each stage, and then do some word-parallel logical operations to 
compute x and y, using the equations in Figure 14-7. The values of x 
and y are intermingled in the even and odd bit positions of a word, so 
they have to be separated by the unshuffle operation (see page 107). 
This might seem a bit complicated, and likely to pay off only for 
rather large values of n, but let us see how it goes. 


A procedure for this operation is shown in Figure 14-8 [GLS1]. The 
procedure operates on fullword quantities, so it first pads the input s 
on the left with '01' bits. This bit combination does not affect the swap 
and complement quantities. Next, a quantity cs (complement-swap) 
is computed. This word is of the form cscs...cs, where each c (a 
single bit), if 1, means that the corresponding pair of bits is to be 
complemented, and each s means that the corresponding pair of bits 
is to be swapped, following Table 14-3. In other words, these two 
statements map each pair of bits of s as follows: 


Sopa 1 
01 
(0X0) 


Mo 
11 


Figure 14-8 Parallel prefix method for computing (x, y) from s. 


void hil_xy_from_s(unsigned s, int n, unsigned *xp, 
unsigned *yp) 


unsigned comp, swap, cs, t, sr; 


s = s | (0x55555555 << 2*n); // Pad s on left with 01 
sr = (s >> 1) & 0x55555555; // (no change) groups. 
cs = ((s & 0x55555555) + sr) // Compute complement & 

^ 0x55555555; // swap info in two-bit 

// groups. 
// Parallel prefix xor op to propagate both complement 
// and swap info together from left to right (there is 
// no step "cs *= cs >> 1", so in effect it computes 
// two independent parallel prefix operations on two 
// interleaved sets of sixteen bits). 
cs = cs * (cS >> 2); 
cs = cs ^ (cs >> 4); 
cs = cs ^ (cs >> 8); 
cs = cs * (cs >> 16); 
swap = cs & 0x55555555; // Separate the swap and 
comp = (cs >> 1) & 0x55555555; // complement bits. 
t = (s & swap) * comp; // Calculate x and y in 
S$ = s+ sre te (ts << O // the odd & even bit 
// positions, resp. 

s=s & ((1 << 2*n) - 1); // Clear out any junk 


// on the left (unpad). 


// Now “unshuffle" to separate the x and y bits. 


t = (s ^ (s >> 1)) & 0x22222222; s=s ^t ^ (t << 1); 
t = (s ^ (s >> 2)) & OxO0COCOCOC; s = s ^t ^ (t << 2); 
t = (s ^ (s >> 4)) & OxOOFOOOFO; s =s ^t ^ (t << 4); 
t = (s ^ (s >> 8)) & OxOOO00FFO0; s =s ^t ^ (t << 8); 
*xp = s >> 16; // Assign the two halves 
*yp = s & OxFFFF; // of t to x and y. 


} 


This is the quantity to which we want to apply the parallel prefix 
operation. PP-XOR is the one to use, going from left to right, because 
successive 1-bits meaning to complement or to swap have the same 
logical properties as exclusive or: Two successive 1-bits cancel each 
other. 


Both signals (complement and swap) are propagated in the same PP- 
XOR operation, each working with every other bit of cs. 


The next four assignment statements have the effect of translating 
each pair of bits of s into (x, y) values, with x being in the odd 
(leftmost) bit positions, and y being in the even bit positions. 
Although the logic may seem obscure, it is not difficult to verify that 
each pair of bits of s is transformed by the logic of the first two 
Boolean equations in Figure 14-7. (Suggestion: Consider separately 
how the even and odd bit positions are transformed, using the fact 
that t and sr are 0 in the odd positions.) 


The rest of the procedure is self-explanatory. It executes in 66 basic 
RISC instructions (constant, branch-free), versus about 19n + 10 
(average) for the code of Figure 14-6 (based on compiled code; 
includes prologs and epilogs, which are essentially nil). Thus, the 


23 


parallel prefix method is faster for n =3. 


14-3 Distance from Coordinates on the Hilbert Curve 


Given the coordinates of a point on the Hilbert curve, the distance 
from the origin to the point can be calculated by means of a state 
transition table similar to Table 14-2. Table 14-5 is such a table. 


Its interpretation is similar to that of the previous section. First, x and 
y should be padded with leading zeros so that they are of length n 
bits, where n is the order of the Hilbert curve. Second, the bits of x 
and y are scanned from left to right, and s is built up from left to 
right. 


A C program implementing these steps is shown in Figure 14-9. 
Figure 14-9 Program for computing s from (x, y). 


unsigned hil_s_from_xy (unsigned x, unsigned y, int n) { 


int i; 


unsigned state, s, row; 


state = 0; // Initialize. 
s = 0; 
for (i =n - 1; i >= 0; i--) { 
row = 4*state | 2*((x >> i) & 1) | (y >> i) & 1; 
s = (s << 2) | (0x361E9CB4 >> 2*row) & 3; 
state = (0Ox8FE65831 >> 2*row) & 3; 


} 


return s; 


[L&S] give an algorithm for computing s from (x, y) that is similar to 
their algorithm for going in the other direction (Table 14-3). It isa 
left-to-right algorithm, shown in Table 14-6 and Figure 14-10. 


Figure 14-10 Lam and Shapiro method for computing s from (x, y). 


unsigned hil_s_from_xy (unsigned x, unsigned y, int n) { 


int i, xi, yi; 


unsigned s, temp; 


s = 0; // Initialize. 
for (i =n - 1; i >= 0; i--) { 
xi = (x >> i) & 1; // Get bit i of x. 
yi = (y >> i) & 1; // Get bit i of y. 
if (yi == 0) { 
temp = x; // Swap x and y and, 
x = y*(-xi); // if xi = 1, 
y = temp’ (-x1i); // complement them. 


s = 4*s + 2*xi + (xi^yi); // Append two bits to 
} 


return s; 


Table 14-5. State transition table for computing S from (X, Y) 
and the nexif Ghent ghetickuadtiss of (x, y) are 
BD, 0) 


€0, 1) 
Al, 0) 
&D 1) 
AD, 0) 
BL 1) 
Bil, 0) 
BD 1) 
w 0) 
BL 1) 
Gi, 0) 
00 1) 
wW 0) 
U 1) 
Al, 0) 
00 1) 


Table 14-6. Lam and Shapiro method for computing S from (X, Y) 
If the next (tamtghppewd bits of (x, y) are 
68a) x and y 
Ob thange 
fiy and complement x and y 
Wo thange 


14-4 Incrementing the Coordinates on the Hilbert Curve 


Given the (x, y) coordinates of a point on the order n Hilbert curve, 
how can one find the coordinates of the next point? One way is to 
convert (x, y) to s, add 1 to s, and then convert the new value of s 
back to (x, y), using algorithms given above. 


A slightly (but not dramatically) better way is based on the fact that as 
one moves along the Hilbert curve, at each step either x or y, but not 
both, is either incremented or decremented (by 1). The algorithm to 
be described scans the coordinate numbers from left to right to 
determine the type of U-curve that the rightmost two bits are on. 
Then, based on the U-curve and the value of the rightmost two bits, it 
increments or decrements either x or y. 


That's basically it, but there is a complication when the path is at the 
end of a U-curve (which happens once every four steps). At this point, 
the direction to take is determined by the previous bits of x and y and 
by the higher order U-curve with which these bits are associated. If 
that point is also at the end of its U-curve, then the previous bits and 
the U-curve there determine the direction to take, and so on. 


Table 14-7 describes this algorithm. In this table, the A, B, C, and D 
denote the U-curves as shown in Table 14-1 on page 246. To use the 
table, first pad x and y with leading zeros so they are n bits long, 
where n is the order of the Hilbert curve. Start in state A and scan the 
bits of x and y from left to right. The first row of Table 14-7 means 
that if the current state is A and the currently scanned bits are (0, 0), 
then set a variable to indicate to increment y, and enter state B. The 
other rows are interpreted similarly, with a suffix minus sign 
indicating to decrement the associated coordinate. A dash in the third 
column means do not alter the variable that keeps track of the 
coordinate changes. 


After scanning the last (rightmost) bits of x and y, increment or 
decrement the appropriate coordinate as indicated by the final value 
of the variable. 


A C program implementing these steps is shown in Figure 14-11. 
Variable dx is initialized in such a way that if invoked many times, 
the algorithm cycles around, generating the same Hilbert curve over 
and over again. (However, the step that connects one cycle to the next 
is not a unit step.) 


Figure 14-11 Program for taking one step on the Hilbert curve. 


void hil_inc_xy(unsigned *xp, unsigned *yp, int n) { 


int ive 


unsigned x, y, state, dx, dy, row, dochange; 


x = *Xp; 

y = *yp; 

state = 0; // Initialize. 

dx = -((1 << n) - 1); // Init. -(2**n - 1). 

dy = 0; 

for (i = n-1; i >= 0; i--) { // Do n times. 
row = 4*state | 2*((x >> i) & 1) | (y >> i) & 1; 
dochange = (0xBDDB >> row) & 1; 


if (dochange) { 


dx = ((0x16451659 >> 2*row) & 3) - 1; 
dy = ((0x51166516 >> 2*row) & 3) - 1; 
} 
state = (0x8FE65831 >> 2*row) & 3; 


*xp = *xp + dx; 


*yp = *yp + dy; 


Table 14-7. Taking one step on the Hilbert curve 


and the neki€(thmbgintentintintedt def (x, y) are 
Ws 0) 
£05 1) 
tf, 0) 
i, 1) 
W 0) 
®, 1) 
BH 0) 
&, 1) 
@+ 0) 
®, 1) 
@, 0) 
ø, 1) 
wW 0) 
W, 1) 
œ, 0) 
&, 1) 


Table 14-7 can readily be implemented in logic, as shown in Figure 
14-12. In this figure, the variables have the following meanings: 


Figure 14-12. Logic circuit for incrementing (x, y) by one step along the 
Hilbert curve. 
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Rit i of input x. 


Bit i of input y. 

X, dd yj swapped and complemented, according to Sj+1 and Cj+1. 
If 1, increment; if 0, decrement (by 1). 

W1, increment or decrement x; if 0, increment or decrement y. 
£1, swap xj and yj. 

G:1, complement xi and yi. 


S and C together identify the "state" of Table 14-7, with (C, S) = (0, 
0), (0, 1), (1, 0), and (1, 1) denoting states A, B, C, and D, 
respectively. The output signals are Ig and Wo, which tell, 
respectively, whether to increment or decrement, and which variable 
to change. (In addition to the logic shown, an incrementer/ 
decrementer circuit is required, with MUX's to route either x or y to 
the incrementer/ decrementer, and a circuit to route the altered value 
back to the register that holds x or y. Alternatively, two incrementer/ 
decrementer circuits could be used.) 


14-5 Non-recursive Generating Algorithms 


The algorithms of Tables 14-2 and 14-7 provide two non-recursive 
algorithms for generating the Hilbert curve of any order. Either 
algorithm can be implemented in hardware without great difficulty. 
Hardware based on Table 14-2 includes a register holding s, which it 
increments for each step, and then converts to (x, y) coordinates. 
Hardware based on Table 14-7 would not have to include a register 
for s, but the algorithm is more complicated. 


14-6 Other Space-Filling Curves 


As was mentioned, Peano was first, in 1890, to discover a space-filling 
curve. The many variations discovered since then are often called 
"Peano curves." One interesting variation of Hilbert's curve was 
discovered by Eliakim Hastings Moore in 1900. It is "cyclic" in the 
sense that the end point is one step away from the starting point. The 
Peano curve of order 3, and the Moore curve of order 4, are shown in 
Figure 14-13. Moore's curve has an irregularity in that the order 1 


curve is up-right-down ( | } but this shape does not appear in the 
higher-order curves. Except for this minor exception, the algorithms 


for dealing with Moore's curve are very similar to those for the Hilbert 
curve. 


Figure 14-13. Peano (left) and Moore (right) curves. 


The Hilbert curve has been generalized to arbitrary rectangles and to 
three and higher dimensions. The basic building block for a 3- 
dimensional Hilbert curve is shown below. It hits all eight points of a 
2x2x2 cube. These and many other space-filling curves are discussed 
in [Sagan]. 


14-7 Applications 


Space-filling curves have applications in image processing: 
compression, halftoning, and textural analysis [L&S]. Another 
application is to improve computer performance in ray tracing, a 
graphics-rendering technique. Conventionally, a scene is scanned by 
projecting rays across the scene in ordinary raster scan line order (left 
to right across the screen, and then top to bottom). When a ray hits an 
object in the simulated scene's database, the color and other properties 
of the object at that point are determined and the results are used to 
illuminate the pixel through which the ray was sent. (This is an 
oversimplification, but it's adequate for our purposes.) One problem is 
that the database is often large and the data on each object must be 
paged in and cast out as various objects are hit by the scanning ray. 
When the ray scans across a line, it often hits many objects that were 
hit in the previous scan, requiring them to be paged in again. Paging 
operations would be reduced if the scanning had some kind of locality 
property. For example, it might be helpful to scan a quadrant of the 
screen completely before going on to another quadrant. 


The Hilbert curve seems to have the locality property we are seeking. 
It scans a quadrant completely before scanning another, recursively, 
and also does not make a long jump when going from one quadrant to 
another. 


Douglas Voorhies [Voor] has simulated what the paging behavior 
would likely be for the conventional uni-directional scan line 
traversal, the Peano curve, and the Hilbert curve. His method is to 
scatter circles of a given size randomly on the screen. A scan path 
hitting a circle represents touching a new object, and paging it in. But 
when a scan leaves a circle, it is presumed that the object's data 
remains in memory until the scan exits a circle of radius twice that of 
the "object" circle. Thus, if the scan leaves the object for just a short 
distance and then returns to it, it is assumed that no paging operation 
occurred. He repeats this experiment for many different sizes of 
circles, on a simulated 1024 x 1024 screen. 


Assume that entering an object circle and leaving its surrounding 
circle represent one paging operation. Then, clearly the normal scan 
line causes D paging operations in covering a (not too big) circle of 
diameter D pixels, because each scan line that enters it leaves its outer 
circle. The interesting result of Voorhies's simulation is that for the 
Peano curve, the number of paging operations to scan a circle is about 
2.7 and, perhaps surprisingly, is independent of the circle's diameter. 


For the Hilbert curve the figure is about 1.4, also independent of the 
circle's diameter. Thus, the experiment suggests that the Hilbert curve 
is superior to the Peano curve, and vastly superior to the normal scan 
line path, in reducing paging operations. (The result that the page 
count is independent of the circles' diameters is probably an artifact of 
the outer circle's being proportional in size to the object circle.) 


Chapter 15. Floating-Point 


God created the integers, all else is the work of man. 
—Leopold Kronecker 


Operating on floating-point numbers with integer arithmetic and 
logical instructions is often a messy proposition. This is particularly 
true for the rules and formats of the IEEE Standard for Binary 
Floating-Point Arithmetic, IEEE Std. 754-1985, commonly known as 
"IEEE arithmetic." It has the NaN (not a number) and infinities, which 
are special cases for almost all operations. It has plus and minus zero, 
which must compare equal to one another. It has a fourth comparison 
result, "unordered." The most significant bit of the fraction is not 
explicitly present in "normal" numbers, but it is in "denormalized" or 
"subnormal" numbers. The fraction is in signed-true form and the 
exponent is in biased form, whereas integers are now almost 
universally in two's-complement form. There are of course reasons for 
all this, but it results in programs that are full of compares and 
branches, and that present a challenge to implement efficiently. 


We assume the reader has some familiarity with the IEEE standard, 
and thus summarize it here only very briefly. 


15-1 IEEE Format 


We will restrict our attention to the single and double formats (32- 
and 64-bit) described in IEEE 754. The standard also describes "single 
extended" and "double extended" formats, but they are only loosely 
described because the details are implementation-dependent (e.g., the 
exponent width is unspecified in the standard). The single and double 
formats are shown below. 


Single format Double format 
del s 
1 8 23 1 ll 52 


The sign bit s is encoded as 0 for plus, 1 for minus. The biased 
exponent e and fraction f are magnitudes with their most significant 
bits on the left. The floating-point value represented is encoded as 
shown on the next page. 


Single format Double format 
e Í value e f value 
+0 0 +0 
+20) 0 +20) 
£2°-27(1 f) 1 to 2046 $7021 A 
too 2047 ( too 
NaN 2047 NaN 


As an example, consider encoding the number p in single format. In 
binary [Knu1], 


m= 11,0010 0100 0011 1111 0110 1010 1000 1000 1000 0101 1010 0011 0000 10.... 


This is in the range of the "normalized" numbers shown in the third 
row of the table above. The most significant 1 in p is dropped, as the 
leading 1 is not stored in the encoding of normalized numbers. The 
exponent e - 127 should be 1, to get the binary point in the right 
place, and hence e = 128. Thus, the representation is 


O 10000000 10010010000111111011011 


or, in hexadecimal, 


40490FDB, 


where we have rounded the fraction to the nearest representable 
number. 


< 


Numbers with 1 =e S254 are called "normalized numbers." These 
are in "normal" form, meaning that their most significant bit is not 


explicitly stored. Nonzero numbers with e = 0 are called 
"denormalized numbers," or simply "denorms." Their most significant 
bit is explicitly stored. This scheme is sometimes called "gradual 
underflow." Some extreme values in the various ranges of floating- 
point number are shown in Table 15-1. In this table "Max integer" 
means the largest integer such that all integers less than or equal to it, 
in absolute value, are representable exactly; the next integer is 
rounded. 


For normalized numbers, one unit in the last position (ulp) has a 
relative value ranging from 1/224 to 1/233 (about 5.96 x 10-8 to 1.19 x 
10-7) for single format, and from 1/253 to 1/252 (about 1.11 x 10-16 to 
2.22 x 10-16) for double format. The maximum "relative error," for 
round to nearest mode, is half of those figures. 


The range of integers that is represented exactly is from -224 to +224 
(-16,777,216 to +16,777,216) for single format, and from -253 to 

= 253 (-9,007,119,254,740,992 to +.9,007,199,254,740,992) for 
double format. Of course, certain integers outside these ranges, such 
as larger powers of 2, can be represented exactly; the ranges cited are 
the maximal ranges for which all integers are represented exactly. 


Table 15-1. Extreme Values 
Single Precision 
Appltomittéved téalue 
Sitti xttOctéhnorm 
PAEL ]m 
aB Amalized 


BHEO 0000 


Double Precision 
artat RAorm 
DARBAM 
REPMBatOnMsnalized 
RO oO 
RAHKA nie er 


One might want to change division by a constant to multiplication by 
the reciprocal. This can be done with complete (IEEE) accuracy only 
for numbers whose reciprocals are represented exactly. These are the 
powers of 2 from 2-127 to 2127 for single format, and from 2-1023 to 
21023 for double format. The numbers 2-127 and 2-1023 are 
denormalized numbers, which are best avoided on machines that 
implement operations on denormalized numbers inefficiently. 


15-2 Comparing Floating-Point Numbers Using Integer 
Operations 


One of the features of the IEEE encodings is that non-NAN values are 
properly ordered if treated as signed magnitude integers. 


To program a floating-point comparison using integer operations, it is 
necessary that the "unordered" result not be needed. In IEEE 754, the 
unordered result occurs when one or both comparands are NaNs. The 
methods below treat NaNs as if they were numbers greater in 
magnitude than infinity. 


The comparisons are also much simpler if -0.0 may be treated as 
strictly less than + 0.0 (which is not in accordance with IEEE 754). 
Assuming this is acceptable, the comparisons may be done as shown 


i i 
Stand £ 


below. where =" and =denote floating-point comparisons, and 


al 
the == symbol is used as a reminder that these formulas do not treat 
+0.0 quite right. 


afib=(a=26) 
alh=(a>0&a<h) | (a<0&ash) 


azh=(a20&ash) | (a<0&atb) 


If -0.0 must be treated as equal to +0.0, there does not seem to be 
any very slick way to do it, but the following formulas, which follow 
more or less obviously from the above, are possibilities. 


aib=(a=b) | (-a=a&-b=b) 

=(a=b) | ((a | b) = 0x80000000) 

=(a=b) | (((a | 6) & Ox7FFFFFFF) = 0) 
aib=((a20&a<b) | (a<0&asb)) & ((a | b) #0x80000000) 


azb=(a20&asb) | (a<0&asb) | ((a | b) = Ox80000000) 


In some applications, it might be more efficient to first transform the 
numbers in some way, and then do a floating-point comparison with a 
single fixed-point comparison instruction. For example, in sorting n 
numbers, the transformation would be done only once to each 


. [n log, n] 
number, whereas a comparison must be done at least = 
times (in the minimax sense). 


Table 15-2 gives four such transformations. For those in the left 
column, -0.0 compares equal to + 0.0, and for those in the right 
column, -0.0 compares less than + 0.0. In all cases, the sense of the 
comparison is not altered by the transformation. Variable n is signed, 
t is unsigned, and c may be either signed or unsigned. 


The last row shows branch-free code that can be implemented on our 
basic RISC in four instructions for the left column, and three for the 
right column (these four or three instructions must be executed for 
each comparand). 


15-3 The Distribution of Leading Digits 


When IBM introduced the System/360 computer in 1964, numerical 
analysts were horrified at the loss of precision of single-precision 
arithmetic. The previous IBM computer line, the 704 - 709 - 7090 
family, had a 36-bit word. For single-precision floating-point, the 
format consisted of a 9-bit sign and exponent field, followed by a 27- 
bit fraction in binary. The most significant fraction bit was explicitly 
included (in "normal" numbers), so quantities were represented with a 
precision of 27 bits. 


Table 15-2. Preconditioning Floating-Point Numbers for Integer 
Comparisons 
-0.0.& + GO. GREE) 

if (n >= 0) n = n+0x80000000; else n = ~n; Use 
unsigned comparison. 
c = Ox7FFFFFFF; if (n < 0) n = rtn**c¢) Uselsituned 
soygpedisomparison. 
c = Ox8BBBDBDD; if (n < 0) n =c- n; Use signed 
comparison. 
t = ntunmsighedhn ~nXn30) (e>>t;1h)=-nt? UseUséegnédned 
comparison. 


The S/360 has a 32-bit word. For single-precision, IBM chose to have 
an 8-bit sign and exponent field followed by a 24-bit fraction. This 
drop from 27 to 24 bits was bad enough, but it gets worse. To keep 
the exponent range large, a unit in the 7-bit exponent of the S/360 
format represents a factor of 16. Thus, the fraction is in base 16, and 
this format came to be called "hexadecimal" floating-point. The 
leading digit can be any number from 1 to 15 (binary 0001 to 1111). 
Numbers with leading digit 1 have only 21 bits of precision (because 
of the three leading 0's), but they should constitute only 1/15 (6.7%) 
of all numbers. 


No, it's worse than that! There was a flurry of activity to show, both 
analytically and empirically, that leading digits are not uniformly 
distributed. In hexadecimal floating-point, one would expect 25% of 
the numbers to have leading digit 1, and hence only 21 bits of 
precision. 


Let us consider the distribution of leading digits in decimal. Suppose 
you have a large set of numbers with units, such as length, volume, 
mass, speed, and so on, expressed in "scientific" notation (e.g., 6.022 x 
1023). If the leading digit of a large number of such numbers has a 


well-defined distribution function, then it must be independent of the 
units—whether inches or centimeters, pounds or kilograms, and so on. 
Thus, if you multiply all the numbers in the set by any constant, the 
distribution of leading digits should be unchanged. For example, 
considering multiplying by 2, we conclude that the number of 
numbers with leading digit 1 (those from 1.0 to 1.999... times 10 to 
some power) must equal the number of numbers with leading digit 2 
or 3 (those from 2.0 to 3.999... times 10 to some power), because it 
shouldn't matter if our unit of length is inches or half inches, or our 
unit of mass is kilograms or half kilograms, and so on. 


Let f(x), for 1 S < 10, be the probability density function for the 
leading digits of the set of numbers with units. f(x) has the property 


that 
f ESTES 
iT 


is the proportion of numbers that have leading digits ranging from a 
to b. Referring to the figure below, for a small increment Dx, in x, f 
must satisfy 


JF) Ax = fx) xAx 


because f(1) - Dx is, approximately, the proportion of numbers ranging 
from 1 to 1 + Dx (ignoring a multiplier of a power of 10), and f(x) - 
xDx is the approximate proportion of numbers ranging from x to x + 
xDx. Because the latter set is the first set multiplied by x, their 


proportions must be equal. Thus, the probability density function is a 
simple inverse relationship, 


fix) = Alx. 


Because the area under the curve from x = 1 to x = 10 must be 1 (all 
numbers have leading digits from 1.000... to 9.999...), it is easily 
shown that 


Al) = 1/4 in 10. 


The proportion of numbers with leading digits in the range a to b, 
with 1 <a <p < 10, is 


h h 


f dv _ Inx| _ nbsa _ jne 2 
kxlnlo mioj, O nio a” 


Thus, in decimal, the proportion of numbers with leading digit 1 is 


— 
logio(2/1) ““"0.30103. and the proportion of numbers with leading 
mar 
digit 9 is log10(10/9) === 0.0458. 


For base 16, the proportion of numbers with leading digits in the 
range a to b, with 1 Sa Sp < 16, is a derived to be logi¢6(b/ 
a). Hence the proportion of numbers with leading digit 1 is log16(2/1) 


= 1/log2 16 = 0.25. 


15-4 Table of Miscellaneous Values 


Table 15-3 shows the IEEE representation of miscellaneous values that 
may be of interest. The values that are not exact are rounded to the 
nearest representable value. 


Table 15-3. Miscellaneous Values 
Bingit (e) 
FRBO 0000 0000 0000 
€900 0000 0000 0000 
BEBO 0000 0000 0000 
HEBO 0000 0000 0000 
00 0000 0000 0000 
G@omO0 0000 0000 0000 
Smallestoswitiveo denr 
DAPESFAMrAFFF FFFF 
Dewo positive awrmatiaed 
PEEBO OR9E74359 9D39 
GHBE O99099 9999 999A 
Repe 2AOMOINNAY 79FF 
378 @.88B8..362C EF38 
PADE GAMH34329 E50E 
6&0 0000 0000 0000 
BF2 g0 7084.FEFA 39EF 


EErEE J? 
PARAB PABMOAZBC C153 


BEBO 0000 0000 0000 
BFrB¢1O984..7RAD 030B 


A pla pass 6§7F 3BCD 


PARZ 46442652B 82FE 


ava P545 E958 ACAA 


2000 0000 0000 0000 
HODR (3B82.BBB5 5516 
é d25 1886.84 8B14 5769 
3848 0000 0000 0000 
PCBA4araB 5444 2D18 


TB ERIGE 9253 
WBA 09681B2.D.9.9 A371 
4@80 0000 0000 0000 
ga 0000 0000 0000 
6@re 0000 0000 0000 


2PG6.2885B) 5444 2D18 
Z2M£0 0000 0000 0000 
8mo 0000 0000 0000 
gma 0000 0000 0000 
40a 0000 0000 0000 
406 0000 0000 0000 
4228 0000 0000 0000 
28908 0000 0000 0000 
4460 0000 0000 0000 
49a 0000 0000 0000 
2690 0000 0000 0000 
41806 RBACISAH3 C1F8 
22B5F1FFFE C000 0000 
2#%60 0000 0000 0000 
286F1FFFF E000 0000 
2%80 0000 0000 0000 
27091 DBHD FFCO 0000 
2%H0 0000 0000 0000 
22291 DBDD FFEO 0000 
2380 0000 0000 0000 
8380 0000 0000 0000 
88H0 0000 0000 0000 
8880 0000 0000 0000 
Lagestaamalized FFFF 
BEBO 0000 0000 0000 
“GrBAll GO OMNAN OO 0001 
‘Teegest?SNaNFFF FFFF 
“Grnall@e'oQNaWoo 0000 
‘TersestfONaN FF FFFF 


IEEE 754 does not specify how the signaling and quiet NaNs are 
distinguished. Table 15-3 uses the convention employed by PowerPC, 
the AMD 29050, the Intel x86 and 1860, and the Fairchild Clipper: The 
most significant fraction bit is 0 for signaling and 1 for quiet NaN's. 
The Compaq Alpha, HP PA-RISC, and MIPS computers use the same 
bit to make the distinction, but in the opposite sense (0 = quiet, 1 = 
signaling). 


Chapter 16. Formulas for Primes 


Introduction 
Willans's Formulas 
Wormell's Formula 


Formulas for Other Difficult Functions 


16-1 Introduction 


Like many young students, I once became fascinated with prime 
numbers, and tried to find a formula for them. I didn't know exactly 
what operations would be considered valid in a "formula," or exactly 
what function I was looking for—a formula for the nth prime in terms 
of n, or in terms of the previous prime(s), or a formula that produces 
primes but not all of them, and so on. Nevertheless, in spite of these 
ambiguities, I would like to discuss a little of what is known about this 
problem. We will see that (a) there are formulas for primes, and (b) 
none of them are very satisfying. 


Much of this subject relates to the present work in that it deals with 
formulas similar to those of some of our programming tricks, albeit in 
the domain of real number arithmetic rather than "computer 
arithmetic." But let us first review a few highlights from the history of 
this subject. 


In 1640, Fermat conjectured that the formula 


always produces a prime, and numbers of this form have come to be 
called "Fermat numbers." It is true that Fn is prime for n ranging from 
0 to 4, but Euler found in 1732 that 


Fo = 27 +1 = 641 - 6700417. 


(We have seen these factors before in connection with dividing by a 
constant on a 32-bit machine). Then, F. Landry showed in 1880 that 
F 


s = 2°41 = 274177 - 67280421310721. 


F, is now known to be composite for many larger values of n, such as 
all n from 7 to 16 inclusive. For no value of n > 4 is it known to be 
prime [H&W]. So much for rash conjectures. [1] 


[1] However, this is the only conjecture of Fermat known to be wrong 
[Wells]. 


Incidentally, why would Fermat be led to the double exponential? He 
knew that if m has an odd factor other than 1, then 2™ + 1 is 
composite. For if m = ab with b odd and not equal to 1, then 


Jah p | = (244 1)(2alb- 1 path 20.4 b- 1), 


Knowing this, he must have wondered about 2™ + 1 with m not 
containing any odd factors (other than 1)—that is, m = 2. He tried a 


few values of n and found that 27" +] seemed to be prime. 


Certainly everyone would agree that a polynomial qualifies as a 
"formula." One rather amazing polynomial was discovered by 
Leonhard Euler in 1772. He found that 


flr) = ne +n 44) 


is prime-valued for every n from 0 to 39. His result can be extended. 
Because 


fi-n) = n?=-n+41 = fin-1), 


f(-n) is prime-valued for every n from 1 to 40; that is, f(n) is prime- 
valued for every n from -1 to -40. Therefore, 


fin -—40) = (n-40)? + (n-40) + 41 = n?-79n + 1601 


is prime-valued for every n from 0 to 79. (However, it is lacking in 
aesthetic appeal because it is nonmonotonic and it repeats; that is, for 
n = 0, 1, ..., 79, n2 -79n + 1601 = 1601, 1523, 1447, ..., 43, 41, 
41, 43, ..., 1447, 1523, 1601.) 


In spite of this success, it is now known that there is no polynomial 
f(n) that produces a prime for every n (aside from constant 
polynomials such as f(n) = 5). In fact, any nontrivial "polynomial in 


exponentials" is composite infinitely often. More precisely, as stated in 
[H&W], 


Theorem. If f(n) = P(n, 2", 3, ..., k”) is a polynomial in its 


arguments, with integral coefficients, and f(n) ==} cowhen n => 
6, then f(n) is composite for an infinity of values of n. 


Thus, a formula such as n2- 2n + 2n3 + 2n + 5 must produce an 


infinite number of composites. On the other hand, the theorem says 
: P ya" 
nothing about formulas containing terms such as =  *n®, and n!. 


A formula for the nth prime in terms of n can be obtained by using the 
floor function and a magic number 


a = 0.20300500070001 1000013... 


The number a is, in decimal, the first prime written in the first place 
after the decimal point, the second prime written in the next two 
places, the third prime written in the next three places, and so on. 
There is always room for the n th prime, because pn < 102. We will 
not prove this, except to point out that it is known that there is always 


a prime between n and 2n (for n =), and hence certainly at least 
one between n and 10n, from which it follows that pn < 10°. The 


formula for the nth prime is 
m? + Fl a — fl 
P = [10 £ a|- 10° 10 z al, 


where we have used the relation 1 + 2 + 3 + ... + n = (n2 + n)/2. 
For example, 


| 106a |- 10| 10a | 


203005 — 203000 
= 5. 


Pa 


This is a pretty cheap trick, as it requires knowledge of the result to 


define a. The formula would be interesting if there were some way to 
define a independently of the primes, but no one knows of such a 
definition. 


Obviously, this technique can be used to obtain a formula for many 
sequences, but it begs the question. 


16-2 Willans's Formulas 
C. P. Willans gives the following formula for the nth prime [Will]: 


= en 
m 


2 m I „._ | 
p, = l+ > win by | cos r= Dit 


m= j y=] A | 


The derivation starts from Wilson's theorem, which states that p is 
prime or 1 if and only if (p - 1)! =-1(mod p). Thus, 


(x—I)! +1 
x 


is an integer for x prime or x = 1 and is fractional for all composite x. 
Hence 


Equation 1 


F(x) = coset 1)t+ ‘| Z l. x prime or |, 


A O, x composite. 


Thus, if p(m) denotes [2] the number of primes <n, 


[2] Our apologies for the two uses of p in close proximity, but it's 
standard notation and shouldn't cause any difficulty. 


Equation 2 


mm) = -1+ ¥ F(x). 
x= 1 


Observe that p(pn) = n, and furthermore, 


nun) <n, for mt € py and 


mm) 2n, for m2 p. 


Therefore, the number of values of m from 1 to for which is p(m) 
< nis pn- 1. That is, 


Equation 3 


=a 


Pa = 1+ ¥ (mmen) 


ue = | 


where the summand is a "predicate expression" (0/1-valued). 


Because we have a formula for p(m), Equation (3) constitutes a 
formula for the nth prime as a function of n. But it has two features 
that might be considered unacceptable: an infinite summation and the 
use of a "predicate expression," which is not in standard mathematical 
usage. 


It has been proved that for n 21 there is at least one prime between 
n and 2n. Therefore, the number of primes — 22 is at least n—that is, 
p(2") =n. Thus, the predicate p(m) < n is 0 for m = 2n, so the 
upper limit of the summation above can be replaced with 2°. 


Willans has a rather clever substitute for the predicate expression. Let 


LT(x, y) = TES ; for x = 0, 1,2, ...; y = 1,2, cscs 
l+x 


< < 


Then. if x < v. 1 —v/(1 + x) =y, so 


ESTETIS. > 


“Furthermore. if x =v. then 0 < v/(1 


o< iso 9S (+x) <1, so 0S My/(1 43) <1. 


Applying the floor function, we have 


l, for xey, 


LT(x, y) = 
Ü, for x2 y, 


That is, LT(x, y) is the predicate x < y (for x and y in the given 
ranges). 


Substituting, Equation (3) can be written 


Pp, = l+ x LT (a(t), n) 
I 


iA = 


=" T 
-El l+ | 


Further substituting Equation (2) for p(m) in terms of F(x), and 
Equation (1) for F(x), gives the formula shown at the beginning of this 
section. 


Second Formula 


Willans then gives another formula: 


Hii 


P, = y mEWLa — al |, 


Here, F and p are the functions used in his first formula. Thus, mF(m) 
= m if m is prime or 1, and 0 otherwise. The third factor in the 
summand is the predicate p(m) =n. The summand is 0 except for one 
term, which is the nth prime. For example, 


Py = 11-04 2-1-0 + 3-1-0 + 4-0-0 + 3-1-0 + 6-0-0 + 7-1-1 
+8-0-149-0-1 4 10-061 + 11-104 2... +1600 
T. 


Il 


Third Formula 


Willans goes on to present another formula for the nth prime that does 
not use any "nonanalytic" [3] functions such as floor and absolute 
value. He starts by noting that for x = 2, 3, ..., the function 


[3] This is my terminology, not Willans's. 


; l . ; 
({x- 1)! }? _ an integer + m when .x is prime, 


X 
an integer, when x is composite or |. 


The first part follows from 


((x— 1)? _ D+) al-l 


Ri a rh 


and x divides (x - 1)! + 1, by Wilson's theorem. Thus, the predicate "x 
is prime," for x => is given by 


cel 


sin? a 
Aix) = 

. a 

sin*= 


From this it follows that 


an 
mm) = H Hix), for m = 2,3,.... 


fea 


This cannot be converted to a formula for pn by the methods used in 
the first two formulas, because they use the floor function. Instead, 
Willans suggests the following formula [4] for the predicate x < y, for 


X, y 2: 
[4] We have slightly simplified his formula. 


LTix, ¥) = sin(Z 2) where 


y=] 


= J] &-i. 


r=0 


Thus, if x < y, e = x(x - 1)...(O)(-1)...(x - (y - 1)) = 0 so that LT(, y) 


= sin(p/2) = 1. If x ay, the product does not iane 0, so e 21, 
so that LT(x, y) = sin((p/2) - (an even number)) = 


Finally, as in the first of Willans's formulas, 


+5 -T( mii), a). 


Written out in full, this is the rather formidable 


= ry 
r=} fil sinta pii UY? 


i—i 


ap" = 
Li 


bila 
bo 


Fli 
P, = a+ 2, sIn 
7 E 


Fourth Formula 
Willans then gives a formula for pn+1 in terms of pp: 
I i 


Parl = l+ pit Yy [| fle, +). 


i=lj=l 


where f(x) is the predicate "x is composite," for x >>. that is, 


fix) = | costa uue] 


X 


Alternatively, one could use f(x) = 1 - H(x), to keep the formula free 
of floor functions. 


As an example of this formula, let pn = 7. Then, 


1+7 +f(8) HABIA) +AA) 
+AEADSFAOAAL) +... + ABAO). fF) 

P+ 7+ Lt deb ll +1110 + 20. + 11-1-0101 
11. 


Put 


16-3 Wormell's Formula 


C. P. Wormell [Wor] improves on Willans's formulas by avoiding both 

trigonometric functions and the floor function. Wormell's formula can 

in principle be evaluated by a simple computer program that uses only 
integer arithmetic. The derivation does not use Wilson's theorem. 


22 


Wormell starts with, for x = 2, 


v a i a positive integer, if x is prime, 
B(x) = JI T] (x-aby? = }°P : P 


az2b=2 0, if x is composite. 


Thus, the number of primes Sm is given by 


an 1 + (-1 yn 
J 


Tam) = ¥ 


r=2 = 


because the summand is the predicate "x is prime." 


Observe that, for n >, a >o, 
it s Q0, when a < M, 
[[(l-r+a)} = wn 
r=l a positive integer, when a =n. 


Repeating a trick above, the predicate a < n is 


IG -rea 
I-61) 
i en 


a 


(agn) = 


Because 


Py, = 2+ B {TO <n), 


msl 


we have, upon factoring constants out of summations, 


3 12 : 
= = n= l ~l \2 


As promised, Wormell's formula does not use trigonometric functions. 
However, as he points out, if the powers of -1 were expanded using 
(-1)" = cospn, they would reappear. 


16-4 Formulas for Other Difficult Functions 


Let us have a closer look at what Willans and Wormell have done. We 
postulate the rules below as defining what we mean by the class of 
functions that can be represented by "formulas," which we will call 
"formula functions." Here, x is shorthand for x1, X2, ..., Xn for any n 


>, The domain of values is the integers ... -2, -1, 0, 1, 2, .... 


1. The constants ... -1, 0, 1, ... are formula functions. 


=< 


2. _ The projection functions f(x) = xj, for 1 =i —n, are 
formula functions. 


3. The expressions x + y, x - y, and xy are formula functions, if 
x and y are. 


4. The class of formula functions is closed under composition 
(substitution). That is, f(gi(x), 82), ..., gm(x )) is a formula 
function if f and gj are, fori = 1, ..., m. 


5. Bounded sums and products, written 


i] n oer) 
’ fli, ¥) If 5), 


f= aii} i= alfi 


are formula functions, if a, b, and f are, and a(x ) -S 


Sums and products are required to be bounded to preserve the 
computational character of formulas; that is, formulas can be 
evaluated by plugging in values for the arguments and carrying out a 
finite number of calculations. The reason for the prime on the S and P 
is explained later in this chapter. 


When forming new formula functions using composition, we supply 
parentheses when necessary according to well-established 
conventions. 


Notice that division is not included in the list above; that's too 
complicated to be uncritically accepted as a "formula function." Even 
so, the above list is not minimal. It might be fun to find a minimal 
starting point, but we won't dwell on that here. 


This definition of "formula function" is close to the definition of 
"elementary function" given in [Cut]. However, the domain of values 
used in [Cut] is the non-negative integers (as is usual in recursive 
function theory). Also, [Cut] requires the bounds on the iterated sum 
and product to be 0 and x - 1 (where x is a variable), and allows the 
range to be vacuous (in which case the sum is defined as 0 and the 
product is defined as 1). 


In what follows, we show that the class of formula functions is quite 
extensive, including most of the functions ordinarily encountered in 
mathematics. But it doesn't include every function that is easy to 
define and has an elementary character. 


Our development is slightly encumbered, compared to similar 
developments in recursive function theory, because here variables can 
take on negative values. However, the possibility of a value's being 
negative can often be accommodated by simply squaring some 
expression that would otherwise appear in the first power. Our 
insistence that iterated sums and products not be vacuous is another 
slight encumbrance. 


Here, a "predicate" is simply a 0/1-valued function, whereas in 
recursive function theory a predicate is a true/false-valued function, 
and every predicate has an associated "characteristic function" that is 
0/1-valued. We associate 1 with true and 0 with false, as is universally 
done in programming languages and in computers (in what their and 
and or instructions do); in logic and recursive function theory, the 
association is often the opposite. 


The following are formula functions: 
1. a2 = aa, a = aaa, and so on. 
2. The predicate a =b: 


ju- by 
(a=b)= JPO- 


j=0 


3. (a Fp) = 1 -(a = b). 
>p: 


4. The predicate a 


(a—b}? 


(a2b)= ¥'(a-b) =) 
i=0 


(a=b)? (fa—5)- iF 


yy a-h 


i=0 j=0 


We can now explain why we do not use the convention that a vacuous 
iterated sum/product has the value 0/1. If we did, we would have 
such shams as 


-(a- by h-i 


(a=b)= ¥ | and (azb) = [] 0. 


fed ea 


The comparison predicates are key to everything that follows, and we 
don't wish to have them based on anything quite that artificial. 


5. (@>b)=G@atb +1). 

6. (a Sb) = 6 2a) 

7. (a<b)=(b>a). 

8. lal = (2a 20)- 1)a. 

a maae Gates eh 


10. min(a, b) = (a >» (b-a) + a. 


Now we can fix the iterated sums and products so that they give the 
conventional and useful result when the range is vacuous. 


Mri l mastai h bi 
PJE) = (b(X) 2 a(x) EA X) 
11. i= a(k) i= alf) 


bX) maxla( ih, H(I) 
[]f x) = 1+ al [TA 
12. f=atk) i=a(k) 


From now on we will use S and P without the prime. All functions 
thus defined are total (defined for all values of the arguments). 


14. 


15. 


16. 


17. 
18. 


19. 


20. 


21. 


22. 


< 


This gives n! = 1 for n =0. 

In what follows, P and Q denote predicates. 
=P) = 1 - P(x). 

PO) & Q&) = PNR). 

P(x) | 0D = 1-1 -PEDA - O&)). 


P(x) Bow = (PŒ) - Q(x). 
if P(x’) then f(y) else g(z) = Px )f(y) + A - P(x ))g(z). 


E n 
a" = ifn 20 then [Ja else 0. 
ial 
This gives, arbitrarily and perhaps incorrectly for a few cases, 
the result 0 for n < 0, and the result 1 for 0°. 


(in <x <n)P(x, 3) = TI PiS). 


ï= 


(ms aven)P(x,F) = 1- Il (1 — P(x, #)). 


xam 
V is vacuously true, d is vacuously false. 


(m £ min x< n)P(x,¥) = m+ 5 Il (1 - Pj, ¥)). 


f=mjom 


The value of this expression is the least x in the range m ton 
such that the predicate is true, or m if the range is vacuous, or 
n + 1 if the predicate is false throughout the (nonvacuous) 
range. The operation is called "bounded minimalization" and it 
is a very powerful tool for developing new formula functions. 
It is a sort of functional inverse, as illustrated by the next 


23. 


24. 


25. 


26. 


27. 


28. 


29. 


30. 


31. 


formula. That minimalization can be done by a sum of 
products is due to Goodstein [Good]. 


| a] = (02min k<|nl (e+ 1)? >n). 


This is the "integer square root" function, which we define to 
be 0 for n < 0, just to make it a total function. 


d/n = (-|n| <q, |n) = qd). 
This is the "d divides n" predicate, according to which 0|0 but 
=(0|n) for n Æo. 


n+ d= ifn =o then (-n Simin q Smo <à Sid] -1) 


(n = ad + r) else (n riin q eer + 1 >r =0)(n = 


qd +r 


This is the conventional truncating form of integer division. 
For d = 0 it gives a result of |n| + 1, arbitrarily. 


rem(n, d) = n - (n + d)d. 


This is the conventional remainder function. If rem(n, d) is 
nonzero, it has the sign of the numerator n. If d = 0, the 
remainder is n. 


isprime(n) = n 22 & =(2 <J Sn) - 1)(d/n). 


mn) = y isprime(s). 


i=] 


(Number of primes Sha) 


Pn = (1 Smin k Empu) =n). 


< 


exponent(p, n) = (0 —min x Sjap- p* + 1|n). 


2ı. 


This is the exponent of a given prime factor p of n, for n 


For n = =o. 


Il 
N 
N 

te 

Il 

-0 

Nm 

N 
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ll 

el 

N 
oO 
Ld 
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32. The nth digit after the decimal point in the decimal expansion 
of 


J: remil 2-102" |, 10). 


Thus, the class of formula functions is quite large. It is limited, 
though, by the following theorem (at least): 


Theorem. If f is a formula function, then there is a constant k such 
that 


where there are k 2's. 


This can be proved by showing that each application of one of the 
rules 1-5 (on page 279) preserves the theorem. For example, if f(x) = 
c (rule 1), then for some h, 


AX) S22" 4h, 


where there are h 2's. Therefore, 


rajaja 


f(x) 32° h+2, 
; x = 
because max(|x i} EES iad) )2 0. or 
cause im For | 
A(X) = x; (rule 2), f(x) S maxi], - [els ane 


theorem holds with k = 0. 


For rule 3, let 


Then, clearly 


yam 


f(z) + g(4) 32-26" max(k), Ka) 


gapet pmax(k), ka) + 1. 


Similarly, it can be shown that the theorem holds for f(x, y) = xy. 


The proofs that rules 4 and 5 preserve the theorem are a bit tedious 
but not difficult, and are omitted. 


From the theorem, it follows that the function 


Equation 4 


is not a formula function, because for sufficiently large x, Equation (4) 
exceeds the value of the same expression with any fixed number k of 
2's. 


For those interested in recursive function theory, we point out that 
Equation (4) is primitive recursive. Furthermore, it is easy to show 
directly from the definition of primitive recursion that formula 
functions are primitive recursive. Therefore, the class of formula 
functions is a proper subset of the primitive recursive functions. The 
interested reader is referred to [Cut]. 


In summary, this section shows that not only is there a formula in 
elementary functions for the nth prime, but also for a good many 
other functions encountered in mathematics. Furthermore, our 
"formula functions" are not based on trigono-metric functions, the 
floor function, absolute value, powers of -1, or even division. The only 
sneaky maneuver is to use the fact that the product of a lot of 
numbers is 0 if any one of them is 0, which is used in the formula for 
the predicate a = b 


It is true, however, that once you see them, they are not interesting. 
The quest for "interesting" formulas for primes should go on. For 
example, [Rib] cites the amazing theorem of W. H. Mills (1947) that 
there exists a q such that the expression 


Le*] 


is prime-valued for all n 2>. Actually, there are an infinite number of 
such values (e.g., 1.3063778838+ and 1.4537508625483 +). 


Furthermore, there is nothing special about the "3"; the theorem is 


> 


true if the 3 is replaced with any integer =3 (for different values of 
q). Better yet, the 3 can be replaced with 2 if it is true that there is 


always a prime between n2 and (n + 1)2, which is almost certainly 
true, but has never been proved. And furthermore, ... well, the 
interested reader is referred to [Rib] and to [Dud] for more fascinating 
formulas of this type. 


Appendix A. Arithmetic Tables for a 4-Bit Machine 


In the tables in Appendix A, underlining denotes signed overflow. 


Table A-1.. Addition 


The table for subtraction (Table A-2) assumes that the carry bit for a - 
b is set as it would be fora + b + 1, so that carry is equivalent to 
"not borrow." 


Table A-2.. Subtraction (row - column) 


ha 
ES 
ra 


For multiplication (Tables A-3 and A-4), overflow means that the 
result cannot be expressed as a 4-bit quantity. For signed 
multiplication (Table A-3), this is equivalent to the first five bits of the 
8-bit result not being all 1's or all 0's. 


Table A-3.. Signed Multiplication 


Table A-4.. Unsigned Multiplication 


B 
A 
@ 
aa 
B8 
BA 
B8 
58 
58 
Ba 
Ba 
a8 
B8 
Ba 


SS pmo ps 


Tables A-5 and A-6 are for conventional truncating division. Table A-5 
shows a result of 8 with overflow for the case of the maximum 
negative number divided by -1, but on most machines the result in 
this case is undefined, or the operation is suppressed. 


Table A-5.. Signed Short Division (row + column) 


Bmw eeHe He Owe 


Table A-6.. Unsigned Short Division (row + column) 


BS PHP RP SP PER HS SP EH SO wf 


Tables A-7 and A-8 give the remainder associated with conventional 
truncating division. Table A-7 shows a result of 0 for the case of the 
maximum negative number divided by -1, but on most machines the 
result for this case is undefined, or the operation is suppressed. 


Table A-7.. Remainder for Signed Short Division (row + column) 


See seoenwf © 


Table A-8.. Remainder for Unsigned Short Division (row + column) 


DARDS SAR ASR SS SMS SB me we 


Appendix B. Newton's Method 


To review Newton's method very briefly, we are given a differentiable 
function f of a real variable x and we wish to solve the equation f(x) 
= 0 for x. Given a current estimate x, of a root of f, Newton's method 
gives us a better estimate x, + 1, under suitable conditions, according 
to the formula 


Here, f'(x,) is the derivative of f at x = x,y. The derivation of this 
formula can be read off the figure below (solve for xp + 1). 


¥ 


v= fix) 
slope = 
“iy Fn 
saci‘ ends a Se lye = = 
Toa) Let al 


The method works very well for simple, well-behaved functions such 
as polynomials, provided the first estimate is quite close. Once an 
estimate is sufficiently close, the method converges quadratically. 
That is, if r is the exact value of the root, and x, is a sufficiently close 
estimate, then 


+ * 
an T r| = (x,-r}. 


Thus, the number of digits of accuracy doubles with each iteration 


ieee |x,- r| $0,001, then |x, ,,—7|<$0.000001). 


If the first estimate is way off, then the iterations may converge very 
slowly, may diverge to infinity, may converge to a root other than the 
one closest to the first estimate, or may loop among certain values 
indefinitely. 


This discussion has been quite vague because of phrases like "suitable 
conditions," "well-behaved," and "sufficiently close." For a more precise 
discussion, consult almost any first-year calculus textbook. 


In spite of the caveats surrounding this method, it is occasionally 
useful in the domain of integers. To see whether or not the method 
applies to a particular function, you have to work it out, such as is 
done in Section 11-1, "Integer Square Root," on page 203. 


Table B-1 gives a few iterative formulas derived from Newton's 
method, for computing certain numbers. The first column shows the 
number it is desired to compute. The second column shows a function 
that has that number as a root. The third column shows the right-hand 
side of Newton's formula corresponding to that function. 


It is not always easy, incidentally, to find a good function to use. 
There are, of course, many functions that have the desired quantity as 
a root, and only a few of them lead to a useful iterative formula. 
Usually, the function to use is a sort of inverse of the desired 


computation. For example, to find da use f(x) = x2 - a; to find logsa 
use (x) = 2x - a, and so on. [1] 


[1] Newton's method for the special case of the square root function 
was known to Babylonians about 4,000 years ago. 


Table B-1.. Newton's Method for Computing Certain Numbers 


Quatttitstiakotimmiauted 


X(2 a ax) 


The iterative formula for logs a converges (to logz a) even if the 
multiplier 1/ln2 is altered somewhat (for example, to 1, or to 2). 
However, it then converges more slowly. A value of 3/2 or 23/16 


et 
might be useful in some applications (1/lIn2 1.4427). 
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